Diff - 6f662e95756333284450ff9c3c6e78c796aa6e77^! - platform/external/pytorch

commit	6f662e95756333284450ff9c3c6e78c796aa6e77	[log] [tgz]
author	Jiang, Yanbing <[email protected]>	Thu Jul 11 15:26:47 2024 +0000
committer	PyTorch MergeBot <[email protected]>	Thu Jul 11 15:26:48 2024 +0000
tree	8d4d3a0645b4e6a37653144d963072733be796d9
parent	c4a2b6a943adfdcd24bff2e966209fe8aa00a085 [diff] [blame]

update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima

@@ -9181,8 +9181,10 @@ def convert_weight_to_int4pack(b): b_int32, b_scales_and_zeros = _group_quantize_tensor( - b, n_bit=4, q_group_size=q_group + b.to("cpu"), n_bit=4, q_group_size=q_group ) + b_int32 = b_int32.to("mps") + b_scales_and_zeros = b_scales_and_zeros.to("mps") b_int4pack = torch._convert_weight_to_int4pack( b_int32, inner_k_tiles )