--- license: apache-2.0 base_model: google/gemma-4-12B-it base_model_relation: quantized pipeline_tag: image-text-to-text library_name: transformers tags: - gemma4 - fp8 - modelopt - tensorrt - sglang - quantized - hopper - blackwell --- # AxionML Gemma-4-12B-FP8 Developed by AxionML for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community. This is an **FP8-quantized** version of [google/gemma-4-12B-it](https://huggingface.co/google/gemma-4-12B-it) (11.95B params), using **per-block 128×128 weight-only FP8 (E4M3)** with an FP8 KV-cache, MSE-calibrated. Activations are kept in BF16 — this is deliberate: gemma-4's attention activations carry large per-channel outliers, so quantizing activations (per-tensor W8A8) measurably degrades the model. Quantizing only the weights is lossless on GSM8K (0.9666 vs 0.9636 BF16) while halving the weight footprint (~24 GB → ~13 GB). Serves on Hopper (H100/H200) and Blackwell. ## Quantization Details This model was quantized by applying **per-block (128×128) FP8 (E4M3) to the weights** of the linear operators within the transformer blocks. **Activations are kept in BF16** (weight-only). The **KV-cache is quantized to FP8 (E4M3)**. `lm_head` and the multimodal (vision/audio) embedders are kept in their original BF16 precision. | | | |---|---| | Quantization format | FP8 `fp8_pb_wo` — per-block 128×128 weight-only E4M3, MSE weight calibration | | Activations | BF16 (not quantized) | | KV-cache | FP8 (E4M3) | | Calibration dataset | `cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (ModelOpt `cnn_nemotron_v2_mix` default, 2048 samples) | | Quantized checkpoint size | ~13 GB (vs ~24 GB BF16) | | Tool | [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (`0.45.0.dev158+gf9423c0d3`, built from source) | | Target hardware | Hopper (H100/H200, sm_90) and Blackwell (sm_100/103/120) | ## Usage ### Deploy with SGLang Requires the SGLang branch in [SGLang support](#sglang-support) below (`fp8_pb_wo` block-FP8 support + Blackwell UE8M0 scale requant + transformers≥5.10 multimodal name handling for Gemma-4). ```bash sglang serve --model-path AxionML/Gemma-4-12B-FP8 \ --quantization modelopt_fp8 \ --kv-cache-dtype fp8_e4m3 \ --reasoning-parser gemma4 \ --tool-call-parser gemma4 \ --mem-fraction-static 0.85 \ --host 0.0.0.0 --port 30000 ``` #### Speculative decoding (MTP / NEXTN) Multi-Token Prediction with the paired `google/gemma-4-12B-it-assistant` draft works on this quantized target with the SGLang branch below. Use the **Triton** attention backend and load the draft **unquantized**: ```bash sglang serve --model-path AxionML/Gemma-4-12B-FP8 \ --quantization modelopt_fp8 \ --kv-cache-dtype fp8_e4m3 \ --attention-backend triton \ --speculative-algorithm NEXTN \ --speculative-draft-model-path google/gemma-4-12B-it-assistant \ --speculative-draft-model-quantization unquant \ --speculative-num-steps 5 --speculative-num-draft-tokens 6 --speculative-eagle-topk 1 \ --reasoning-parser gemma4 --tool-call-parser gemma4 \ --mem-fraction-static 0.85 --host 0.0.0.0 --port 30000 ``` MTP is lossless on GSM8K (see [Accuracy](#accuracy)). Earlier SGLang mis-loaded ModelOpt's attention-projection scales (`self_attn.{k,v}_proj.{k,v}_scale`) as the RadixAttention KV-cache scales, which corrupted the spec-decode verify forward on quantized targets (degenerate output) while BF16 targets were fine. The branch fix leaves gemma-4's KV scales at their identity default (1.0) — correct, because gemma-4 writes K/V to the cache *after* q/k-norm and RoPE, so the projection-output scales are the wrong descale factor. (The related trtllm_mha SWA-pool crash, [sgl-project/sglang#26957](https://github.com/sgl-project/sglang/issues/26957), is already fixed on main.) Sampling defaults for Gemma 4: `temperature=1.0, top_p=0.95, top_k=64`. Thinking mode is off by default; enable with `extra_body={"chat_template_kwargs": {"enable_thinking": True}}`. Smoke test: ```bash curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "default", "messages": [{"role": "user", "content": "What is C. elegans?"}], "temperature": 1.0, "top_p": 0.95, "top_k": 64, "max_tokens": 256 }' ``` ### Reproduce with ModelOpt ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path google/gemma-4-12B-it \ --qformat fp8_pb_wo \ --weight_calib_algorithm mse \ --kv_cache_qformat fp8 \ --export_path ./gemma-4-12B-it-FP8 \ --trust_remote_code ``` (`--weight_calib_algorithm mse` is a small local addition to ModelOpt's `hf_ptq.py` that overrides the qformat's weight calibration to MSE; `fp8_pb_wo`'s stock algorithm is `max`.) ## About FP8 FP8 (E4M3 — 4 exponent bits, 3 mantissa bits, range ±448) stores each weight in a single byte. `fp8_pb_wo` ("per-block weight-only") quantizes weights in **2D 128×128 blocks**, each with its own FP8 scale, and **does not quantize activations** — at runtime the block-scaled FP8 weights drive a block GEMM (DeepGEMM on Blackwell, with UE8M0 scale requant) against BF16 activations. Per-block scaling adapts to local weight magnitude far better than a single per-tensor scale, and **MSE calibration** sweeps each block's scale to minimize ‖W − dequant(quant(W))‖² instead of taking max-of-abs. The KV-cache is additionally stored in FP8 (E4M3) to halve KV memory. **Why weight-only on Gemma-4:** standard per-tensor **W8A8** FP8 (quantized activations, as NVIDIA ships for Llama/Nemotron) degrades sharply on gemma-4 — its attention residual stream has persistent per-channel activation outliers that a single per-tensor activation scale crushes, even within FP8's ±448 range. Leaving activations in BF16 (weight-only) sidesteps this entirely and is lossless. ## About NVFP4 (sister checkpoint) A companion [AxionML/Gemma-4-12B-NVFP4](https://huggingface.co/AxionML/Gemma-4-12B-NVFP4) ships a 4-bit variant following NVIDIA's dense-Gemma-4 recipe ([`nvidia/Gemma-4-31B-IT-NVFP4`](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4)): **NVFP4 (E2M1 + FP8 16-element micro-block scales) on the MLP/FFN, attention kept BF16**, FP8 KV-cache, MSE-calibrated. NVFP4 requires Blackwell (native FP4 Tensor Cores); serve it with `--quantization modelopt_fp4`. ## Accuracy GSM8K (1319 questions, `sgl-eval`, greedy, served on SGLang): | Model | GSM8K | |---|---| | `google/gemma-4-12B-it` (BF16) | 0.9636 | | **AxionML/Gemma-4-12B-FP8** (weight-only, MSE) | **0.9666** | | **AxionML/Gemma-4-12B-FP8 + MTP** (NEXTN) | **0.9598** | | AxionML/Gemma-4-12B-NVFP4 (MLP-only, MSE) | 0.9612 | | AxionML/Gemma-4-12B-NVFP4 + MTP (NEXTN) | 0.9644 | MTP (greedy, exact verify) is lossless within GSM8K run-to-run noise — accuracy holds **with and without** speculative decoding. ## Performance (SPEED-Bench) Latency/throughput measured with [NVIDIA AIPerf](https://github.com/ai-dynamo/aiperf) on the [`nvidia/SPEED-Bench`](https://huggingface.co/datasets/nvidia/SPEED-Bench) **qualitative** split (all 11 domains, 880 prompts each issued once, `shuffle` / seed 42), greedy, output capped at 512 tokens, OpenAI `chat` + streaming, one Blackwell GPU, served on the SGLang branch below. Prompts are short (ISL ≈ 145, OSL ≈ 410 tokens). MTP uses the `google/gemma-4-12B-it-assistant` NEXTN draft. **Concurrency 1 — single-stream latency** (the low-latency serving regime): | Config | TTFT (ms) | ITL (ms) | tok/s/user | accept len | |---|---|---|---|---| | `gemma-4-12B-it` BF16 | 19.4 | 6.47 | 154.6 | — | | **FP8** | 28.7 | 5.91 | 169.3 | — | | **FP8 + MTP** | 27.5 | 3.09 | **338.2** | 3.50 | - FP8 vs BF16: **1.10×** single-stream tokens/s (memory-bandwidth-bound — the 13 GB weight footprint wins; quant adds a little TTFT). - MTP on FP8: **2.00×** tokens/s, ITL **1.91×** lower (accept length 3.50 of 6 draft tokens). - **FP8 + MTP vs BF16 baseline: ≈ 2.19×** single-stream tokens/s. **Concurrency 32 — throughput** (saturated / compute-bound): | Config | agg tok/s | req/s | TTFT (ms) | accept len | |---|---|---|---|---| | `gemma-4-12B-it` BF16 | 3250 | 7.8 | 36 | — | | **FP8** | 2930 | 7.6 | 55 | — | | **FP8 + MTP** | 2813 | 7.3 | 71 | 3.23 | At saturation the GPU is compute-bound, so the `fp8_pb_wo` block-GEMM (with dequant) doesn't beat BF16 dense GEMM on aggregate throughput (0.90×), and MTP is roughly neutral (0.96×). **Takeaway:** FP8 — especially with MTP — pays off most in the **low-concurrency / latency-bound** regime; at saturation, throughput is comparable across formats. ## SGLang support Gemma 4 (including the encoder-free unified 12B) is supported on SGLang main. Serving this **`fp8_pb_wo`** checkpoint additionally needs the branch below, which adds: (1) `fp8_pb_wo` per-block weight-only FP8 to the ModelOpt FP8 path (the stock path is per-tensor only) plus Blackwell UE8M0 scale requant for the DeepGEMM block kernel; (2) remap of the `embed_vision.*` multimodal weight names emitted by a transformers≥5.10 ModelOpt re-export. It also fixes speculative decoding (NEXTN/MTP) on quantized targets: SGLang must **not** load ModelOpt's attention-projection scales (`self_attn.{k,v}_proj.{k,v}_scale`) as the RadixAttention KV-cache `{k,v}_scale` — gemma-4 caches K/V post-norm/post-RoPE, so those are the wrong descale factor and corrupt the spec verify forward; the KV scales correctly default to 1.0. ```bash # Editable install of the branch git clone https://github.com/bzhng-development/sglang.git cd sglang && git checkout gemma4-modelopt-ptq pip install -e python # transformers with Gemma 4 (encoder-free unified) support pip install 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897' ``` Branch: [`bzhng-development/sglang@gemma4-modelopt-ptq`](https://github.com/bzhng-development/sglang/tree/gemma4-modelopt-ptq) (off `sgl-project/sglang` main). ### Run with Docker (SGLang nightly) Serving needs the [SGLang branch](#sglang-support), so base it on a recent **SGLang nightly** image (`lmsysorg/sglang:nightly-dev-YYYYMMDD-`; `cu13` variants exist for CUDA-13 hosts). The nightly already installs SGLang as an **editable** install rooted at `/sgl-workspace/sglang`, so the command below simply swaps that directory for the branch checkout — no reinstall needed — then pins the matching transformers, fetches the checkpoint, and starts the server, which will then be **listening at `http://0.0.0.0:30000`** (change `--port` to use a different port): ```bash docker run --gpus all --shm-size=128g --network=host \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e HF_TOKEN=$HF_TOKEN \ lmsysorg/sglang:nightly-dev-20260604-14ed9b44 \ bash -lc ' cd / && rm -rf /sgl-workspace/sglang && git clone https://github.com/bzhng-development/sglang.git /sgl-workspace/sglang && cd /sgl-workspace/sglang && git checkout gemma4-modelopt-ptq && pip install "git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897" && python -m sglang.launch_server \ --model-path AxionML/Gemma-4-12B-FP8 \ --quantization modelopt_fp8 \ --kv-cache-dtype fp8_e4m3 \ --reasoning-parser gemma4 --tool-call-parser gemma4 \ --mem-fraction-static 0.85 \ --host 0.0.0.0 --port 30000 ' ``` - `--network=host` publishes the server on the host's port 30000; alternatively drop it and use `-p 30000:30000`. - For **MTP / NEXTN**, append the speculative flags from the **Speculative decoding** section above to the `launch_server` line (`HF_TOKEN` is then required — the draft `google/gemma-4-12B-it-assistant` is gated). - The leading `cd /` matters: the image's default workdir *is* `/sgl-workspace/sglang`, so `rm -rf`-ing it from inside that directory makes `git` fail with *"Unable to read current working directory."* - Any newer `lmsysorg/sglang:nightly-dev-*` tag also works — each ships the same editable `/sgl-workspace/sglang` layout this relies on. - **libnvidia-ml.so:** you *may or may not* need to mount the host NVML library — only if `nvidia-smi` inside the container reports a driver/library version mismatch. If so, add a mount matching your host driver (e.g. `580.82.07`): ``` -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.580.82.07:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro ``` ### ModelOpt install (editable, from source) ```bash git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git cd TensorRT-Model-Optimizer && pip install -e ".[hf]" # commit f9423c0d3 ``` ## Limitations The base model was trained on data that may contain toxic language and societal biases. The quantized model inherits these limitations and may generate inaccurate, biased, or offensive content. Quantization can introduce additional deviations from the base model's behavior. Please refer to the [original model card](https://huggingface.co/google/gemma-4-12B-it) for full details. --- ## Base model [google/gemma-4-12B-it](https://huggingface.co/google/gemma-4-12B-it) is Google DeepMind's dense 11.95B-parameter Gemma 4 "Unified" (encoder-free) multimodal instruction-tuned model: text + image (+ audio) input, 256K context, hybrid sliding-window/global attention, configurable thinking mode, and native function calling. See the upstream card for full architecture, training data, evaluation, and responsible-AI details. This repository changes only the numeric precision of the weights — all capabilities, the chat template, and the tokenizer are inherited unchanged.