Fine-tuned Gemma-3-1b model produces gibberish/empty output after quantization (GPTQ/AWQ/BitsAndBytes all fail)
Environment:
Model: google/gemma-3-1b-pt fine-tuned with Unsloth LoRA (r=8) trained with ChatML Format(as this is pretrained model)
Full precision model: Works perfectly, proper expected responses
Hardware: L40S 48GB VRAM
Issue:
After fine-tuning with Unsloth LoRA and merging weights, all quantization methods fail while the full precision model works perfectly.
Quantization Results:
AWQ (W4A16, W8A16): Produces repetitive gibberish loops and repeating endlessly)
GPTQ (W4A16, W8A8): Outputs all zeros immediately, no actual computation (returns in 20-30sec vs 1min for full precision model)
BitsAndBytes (4-bit, 8-bit): Gibberish output with repetition loops for 8bit and blank output for bit
All methods tried with/without ignore=["lm_head"]
Debugging Done:
Tested different generation parameters (temperature, repetition_penalty, sampling)
Tried various prompt formats (ChatML, simple text)
Verified model dtype shows torch.float16 even after "quantization" (suggesting silent failures)
Full precision model generates proper responses in ~1 minute
Are there quantization parameters specifically recommended for LoRA-merged models, or should quantization-aware training be used instead of post-training quantization for fine-tuned models?
Any guidance on successful quantization of fine-tuned Gemma models would be appreciated.
Thanks!
Great that exporting it to 16bit works. Because the model is so small and you quantize it, it's normal for it drastically decrease in accuracy.
Would recommend doing it in full 16bit or use our new QAT support!
From Reddit