DeepSeek-R1-0528-Qwen3-8B-KV
Enterprise-grade OCP FP8 quantized DeepSeek-R1-0528-Qwen3-8B for AMD ROCm, end-to-end KV-cache in FP8 with Quark
Introduction
DeepSeek-R1-0528-Qwen3-8B-KV is a full-pipeline, OCP-compliant FP8_e4m3 quant of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B, built with AMD Quark and optimized for AMD Instinct GPUs. This model delivers ~1.8× memory savings and throughput boost vs. FP16, with only a nominal perplexity uplift (≈11 PPL on WikiText2).
Quantization Strategy
- Quantizer: AMD Quark v0.9+
- Numeric Format: OCP FP8_e4m3 symmetric, per-tensor
- Scope: All
Linearlayers (excludinglm_head), activations, and KV cache - Group Size: 128 (block-aligned)
- Calibration: 128 Pile samples (default)
- Metadata: scales embedded in JSON + SafeTensors
Performance Snapshot
| Metric | FP16 Baseline | FP8_e4m3 Quantized |
|---|---|---|
| Wikitext2 Perplexity | 10.88 | 11.0 |
| Memory Footprint | 1.0× | 0.56× |
Quick Start
Serve with vLLM
Override model’s context:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Serve
HIP_VISIBLE_DEVICES=0
vllm serve EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV
--kv-cache-dtype fp8
----num-scheduler-steps 10
.. other arguments
Benchmark
python3 /vllm/benchmarks/benchmark_serving.py
--backend vllm
--model EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV
--dataset-name sharegpt
--dataset-path /vllm/ShareGPT_V3_unfiltered_cleaned_split.json
--num-prompts 32
--random-range-ratio 1.0
--percentile-metrics ttft,tpot,itl,e2el
--sharegpt-output-len 256
Evaluation
We benchmarked on WikiText2 using vLLM’s /v1/completions PPL metric:
- FP16 (DeepSeek-R1-0528-Qwen3-8) → 10.88 PPL
- FP8_e4m3 (this model) → 11.00 PPL
The ~0.12-point PPL delta yields massive ROI in memory and speed—with virtually imperceptible quality loss in most benchmarks.
License
This model reuses the DeepSeek-R1-0528-Qwen3-8B license.
- Downloads last month
- 81
Model tree for EliovpAI/Deepseek-R1-0528-Qwen3-8B-FP8-KV
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B