Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses by Sunt-ing · Pull Request #14038 · huggingface/diffusers

Sunt-ing · 2026-06-22T14:21:21Z

What does this PR do?

Group offloading moves a group's parameters between CPU and the accelerator by reassigning param.data:

param.data = source_tensor.to(device)

This is correct for plain tensors but wrong for tensor subclasses (quantized weights), whose real payload lives in internal sub-tensors (quanto WeightQBytesTensor: _data/_scale; torchao AffineQuantizedTensor: qdata/scale/...). Reassigning .data only swaps the outer wrapper and leaves the inner tensors on the source device, so the next matmul fails with mat2 is on cpu, different from cuda:0.

#13276 fixed this for torchao by swapping the whole subclass via torch.utils.swap_tensors and restoring inner attributes one by one. Two gaps remained:

quanto was never handled (Quanto + Group Offload causes device mismatch error (weights on cpu, mat1 on gpu) #12610). Any quanto-quantized model with enable_group_offload hits the wrapper-only .data = path and crashes with a device mismatch on the first forward, for both leaf_level and block_level.
the streamed path was still broken for both subclasses (Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281). When use_stream=True, _to_cpu / _pinned_memory_tensors call pin_memory() / is_pinned(), which neither subclass supports: quanto silently loses the subclass identity, and torchao raises NotImplementedError: ... aten.is_pinned. So torchao + use_stream=True crashes even though its non-stream path was already fixed.

Changes (`src/diffusers/hooks/group_offloading.py`)

Add _is_quanto_tensor plus quanto helpers, and handle quanto next to the existing torchao branch in _transfer_tensor_to_device (onload), _offload_to_memory (restore / offload), and the record_stream path. Inner tensor names come from the standard subclass protocol __tensor_flatten__(); quanto onload uses torch.utils.swap_tensors instead of .data =.
In _to_cpu and _pinned_memory_tensors, skip pin_memory() / is_pinned() for quanto and torchao subclasses.
Plain tensors and the torchao non-stream path are untouched (zero behavior change).

Tests

Added test_group_offloading to the quanto and torchao quantization suites. Each loads a quantized tiny Flux transformer, offloads it across leaf_level / block_level and non-stream / use_stream, and asserts the output matches the non-offloaded quantized baseline.

tests/quantization/quanto/test_quanto.py (int8 and float8): both fail on main with the device mismatch, pass here.
tests/quantization/torchao/test_torchao.py::TorchAoTest::test_group_offloading: the use_stream=True cases fail on main with the aten.is_pinned error, pass here.

Reproduction and before/after

Environment: NVIDIA RTX 4090, torch==2.8.0+cu128, diffusers @ 2d0110f, optimum-quanto==0.2.7, torchao==0.17.0.

Minimal standalone repro for #12610 (quanto):

import torch
from diffusers import UNet2DConditionModel
from diffusers.hooks import apply_group_offloading
from optimum.quanto import quantize, freeze, qint8

m = UNet2DConditionModel.from_pretrained(
    "hf-internal-testing/tiny-stable-diffusion-pipe", subfolder="unet"
).to(torch.float32).eval()
quantize(m, weights=qint8); freeze(m)
apply_group_offloading(
    m, onload_device=torch.device("cuda"), offload_device=torch.device("cpu"),
    offload_type="leaf_level",
)
x = torch.randn(2, m.config.in_channels, m.config.sample_size, m.config.sample_size, device="cuda")
t = torch.tensor([10, 10], device="cuda")
e = torch.randn(2, 4, m.config.cross_attention_dim, device="cuda")
with torch.no_grad():
    m(x, t, e)  # main: RuntimeError: mat2 is on cpu, different from cuda:0

Running the new tests (RUN_NIGHTLY=1 RUN_SLOW=1):

# on main (fix reverted, tests kept)
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    FAILED  (mat2 is on cpu, different from cuda:0)
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  FAILED  (mat2 is on cpu, different from cuda:0)
torchao TorchAoTest::test_group_offloading                       FAILED  (NotImplementedError: ... aten.is_pinned)

# with this PR
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    PASSED
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  PASSED
torchao TorchAoTest::test_group_offloading                       PASSED

Across leaf_level / block_level × non-stream / use_stream / record_stream, the offloaded output is bit-identical (max abs diff = 0.0) to the fully-on-accelerator quantized baseline. A non-quantized group-offload equivalence sweep stays at 0.0 (plain-tensor path unchanged).

Relationship to other work

The streamed-pin half of Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281 was also pursued upstream in torchao (# Feature Request: Support Async-Stream Transfer for [AffineQuantizedTensor] (Fix diffusers group_offload device mismatch) pytorch/ao#4158, support pinning for mx and nvfp4 tensors pytorch/ao#4192), but that only added pinning for mx / nvfp4 tensors. Int8WeightOnlyConfig AffineQuantizedTensor still raises aten.is_pinned on torchao==0.17.0, so the streamed path is still broken for the common int8 case. Skipping pinning on the diffusers side fixes it regardless of the torchao version, and is also required for quanto, whose subclass tensors do not implement torch pinning at all.
Add TorchAO disk group offload support #13875 and torchao: safetensors save/load + disk group offload (closes #13713) #13721 (open) refactor the same torchao offload helpers (_to_cpu, _pinned_memory_tensors, _swap_torchao_tensor) to add disk offload. They are orthogonal in intent (disk vs the memory device-mismatch / stream-pin crash here) but touch the same region, so this PR will need a rebase around whichever lands first.

Who can review?

cc @sayakpaul

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you write any new necessary tests?

…ath for quantized tensor subclasses

sayakpaul · 2026-06-23T20:36:29Z

Group offloading should have been fixed, though with #13276. Can you check again?

Sunt-ing · 2026-06-25T17:20:38Z

Hi @sayakpaul, thanks. Yes, I rechecked against #13276 before opening this. #13276 makes group offloading work for torchao by swapping the subclass (_is_torchao_tensor → torch.utils.swap_tensors on onload, setattr of inner tensors on the offload restore). Two cases it doesn't cover are exactly what this PR targets:

quanto was never handled (Quanto + Group Offload causes device mismatch error (weights on cpu, mat1 on gpu) #12610). There is no quanto branch anywhere in group_offloading.py (still true on main today), so a quanto-quantized model with enable_group_offload falls through to the plain param.data = source.to(device) path. That swaps only the outer wrapper and leaves _data / _scale on the offload device, so the first forward crashes with mat2 is on cpu, different from cuda:0, for both leaf_level and block_level.
the streamed path is still broken for both subclasses (Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281). As [core] fix group offloading when using torchao #13276 itself notes, its stream handling assumes the subclass implements pinning ("not something we can always guarantee ... we need coordination with TorchAO", support pinning for mx and nvfp4 tensors pytorch/ao#4192). That coordination only added pinning for mx / nvfp4; Int8WeightOnlyConfig's AffineQuantizedTensor still raises NotImplementedError: ... aten.is_pinned on torchao==0.17.0, and quanto implements no torch pinning at all. _to_cpu / _pinned_memory_tensors still call pin_memory() / is_pinned() unconditionally, so torchao + use_stream=True crashes even with [core] fix group offloading when using torchao #13276 in. This PR skips pinning for both subclasses on the diffusers side, so it works regardless of the torchao version.

Both #12610 and #13281 are still open. I confirmed on current main (so with #13276 in) that the three tests this PR adds fail, and pass here:

main (with #13276) vs this PR

# main (fix reverted, tests kept)
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    FAILED  (mat2 is on cpu, different from cuda:0)
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  FAILED  (mat2 is on cpu, different from cuda:0)
torchao TorchAoTest::test_group_offloading                       FAILED  (NotImplementedError: ... aten.is_pinned)

# with this PR
quanto  FluxTransformerInt8WeightsTest::test_group_offloading    PASSED
quanto  FluxTransformerFloat8WeightsTest::test_group_offloading  PASSED
torchao TorchAoTest::test_group_offloading                       PASSED

On approach: I deliberately mirrored the existing _is_torchao_tensor branch rather than touching it, to keep this a low-risk bug fix (_is_quanto_tensor gates on is_optimum_quanto_available() and pulls inner-tensor names from the standard __tensor_flatten__()). I also saw your note in #13276 about generalizing these utilities to swap_tensors for any subclass instead of .data. Happy to fold torchao + quanto into one generic subclass path here, or leave that as the separate follow-up you mentioned, whichever you prefer.

Fix group offloading for quanto-quantized models and the use_stream p…

8ab88ee

…ath for quantized tensor subclasses

github-actions Bot added fixes-issue size/M PR with diff < 200 LOC tests hooks and removed size/M PR with diff < 200 LOC labels Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses#14038

Fix group offloading for quanto-quantized models and the use_stream path for quantized tensor subclasses#14038
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:0

Sunt-ing commented Jun 22, 2026

Uh oh!

sayakpaul commented Jun 23, 2026

Uh oh!

Sunt-ing commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Sunt-ing commented Jun 22, 2026

What does this PR do?

Changes (src/diffusers/hooks/group_offloading.py)

Tests

Relationship to other work

Who can review?

Before submitting

Uh oh!

sayakpaul commented Jun 23, 2026

Uh oh!

Sunt-ing commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changes (`src/diffusers/hooks/group_offloading.py`)