update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.
Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
diff --git a/test/test_mps.py b/test/test_mps.py
index 255d870..d5918ff 100644
--- a/test/test_mps.py
+++ b/test/test_mps.py
@@ -9181,8 +9181,10 @@
def convert_weight_to_int4pack(b):
b_int32, b_scales_and_zeros = _group_quantize_tensor(
- b, n_bit=4, q_group_size=q_group
+ b.to("cpu"), n_bit=4, q_group_size=q_group
)
+ b_int32 = b_int32.to("mps")
+ b_scales_and_zeros = b_scales_and_zeros.to("mps")
b_int4pack = torch._convert_weight_to_int4pack(
b_int32, inner_k_tiles
)