🧪 Gemma 4B — Distilled from Gemma 2

This repository contains a 4B-parameter distilled Gemma model, trained using knowledge distillation from a larger Gemma-2-9B-Instruct teacher.
The objective was to create a smaller, faster model that preserves strong instruction-following behavior while being practical for edge deployment and low-latency inference.


🔥 Overview

  • Teacher Model: google/gemma-2-9b-it
  • Student Model: vishalkhot/gemma-4b-distilled
  • Goal: Achieve near-teacher quality in a ~4B parameter footprint
  • Method: Logits-based distillation with temperature scaling, plus supervised tuning on curated instruction data

This model is ideal for:

  • Chat-based assistants
  • Reasoning over short/medium contexts
  • On-device inference (4B fits easily on a single modern GPU)
  • Serverless or low-cost API deployments

🧬 Distillation Process

Distillation was performed over a mixed instruction dataset, combining reasoning, multi-turn dialogue, tool-usage instructions, and diverse language tasks.
Training used a combination of:

  • 🔹 KL divergence on teacher/student logits
  • 🔹 Cross-entropy loss on reference outputs
  • 🔹 Temperature scaling (T = 2–4)

Training Command Example

python distill.py \
  --teacher-model google/gemma-3-12b-it \
  --student-model google/gemma-3-4b-it\
  --train-file data/train.jsonl \
  --val-file data/val.jsonl \
  --output-dir gemma_4b_distilled \
  --batch-size 8 \
  --gradient-accumulation-steps 2 \
  --num-epochs 1 \
  --learning-rate 1e-5 \
  --warmup-steps 300 \
  --max-length 2048 \
  --temperature 2.5 \
  --alpha 0.5 \
  --save-steps 1000 \
  --eval-steps 200 \
  --logging-steps 10 \
  --bf16
Downloads last month
21
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vishalkhot/gemma-4b-distilled

Finetuned
(139)
this model

Dataset used to train vishalkhot/gemma-4b-distilled

Evaluation results