Fine-tuned Gemma-3-1b model produces gibberish/empty output after quantization (GPTQ/AWQ/BitsAndBytes all fail)

by grpathak22 - opened Sep 20

Sep 20

Environment:

Model: google/gemma-3-1b-pt fine-tuned with Unsloth LoRA (r=8) trained with ChatML Format(as this is pretrained model)
Full precision model: Works perfectly, proper expected responses
Hardware: L40S 48GB VRAM

Issue:
After fine-tuning with Unsloth LoRA and merging weights, all quantization methods fail while the full precision model works perfectly.
Quantization Results:

AWQ (W4A16, W8A16): Produces repetitive gibberish loops and repeating endlessly)
GPTQ (W4A16, W8A8): Outputs all zeros immediately, no actual computation (returns in 20-30sec vs 1min for full precision model)
BitsAndBytes (4-bit, 8-bit): Gibberish output with repetition loops for 8bit and blank output for bit
All methods tried with/without ignore=["lm_head"]

Debugging Done:

Tested different generation parameters (temperature, repetition_penalty, sampling)
Tried various prompt formats (ChatML, simple text)
Verified model dtype shows torch.float16 even after "quantization" (suggesting silent failures)
Full precision model generates proper responses in ~1 minute

Are there quantization parameters specifically recommended for LoRA-merged models, or should quantization-aware training be used instead of post-training quantization for fine-tuned models?
Any guidance on successful quantization of fine-tuned Gemma models would be appreciated.
Thanks!

shimmyshimmer

Unsloth AI org Sep 22

Great that exporting it to 16bit works. Because the model is so small and you quantize it, it's normal for it drastically decrease in accuracy.

Would recommend doing it in full 16bit or use our new QAT support!

From Reddit

shimmyshimmer changed discussion status to closed Sep 22

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment