MedGemma-27B-Text-IT-FP8-Dynamic

Overview

MedGemma-27B-Text-IT-FP8-Dynamic is an FP8 Dynamic–quantized derivative of Google’s MedGemma-27B-Text-IT model, optimized for high-throughput inference while preserving strong performance on medical and biomedical instruction-tuned text-only tasks.

This version is intended for vLLM deployment on modern NVIDIA GPUs and follows a conservative FP8 Dynamic quantization strategy designed for maximum stability.


Base Model

  • Base model: google/medgemma-27b-text-it
  • Architecture: Decoder-only Transformer (instruction-tuned)
  • Domain: Medical / Biomedical NLP
  • Modality: Text-only

Quantization Details

  • Method: FP8 Dynamic
  • Tooling: llmcompressor
  • Quantized layers: Linear layers
  • Excluded components:
    • lm_head

Rationale

  • FP8 Dynamic reduces VRAM usage and improves inference throughput.
  • Excluding lm_head preserves output stability.
  • The resulting model is fully compatible with vLLM.

Weights are already quantized — do not apply runtime quantization.


Intended Use

  • Medical and biomedical instruction-following
  • Clinical text summarization
  • Medical RAG pipelines
  • Decision-support and research assistance

Deployment (vLLM)

Recommended

vllm serve ig1/medgemma-27b-text-it-FP8-Dynamic \
  --served-model-name medgemma-27b-text-it-fp8 \
  --dtype auto
Downloads last month
9
Safetensors
Model size
28B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ig1/medgemma-27b-text-it-FP8-Dynamic

Quantized
(24)
this model