DeepSeek-R1-0528-Qwen3-8B-KV

Enterprise-grade OCP FP8 quantized DeepSeek-R1-0528-Qwen3-8B for AMD ROCm, end-to-end KV-cache in FP8 with Quark


Introduction

DeepSeek-R1-0528-Qwen3-8B-KV is a full-pipeline, OCP-compliant FP8_e4m3 quant of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B, built with AMD Quark and optimized for AMD Instinct GPUs. This model delivers ~1.8× memory savings and throughput boost vs. FP16, with only a nominal perplexity uplift (≈11 PPL on WikiText2).


Quantization Strategy

  • Quantizer: AMD Quark v0.9+
  • Numeric Format: OCP FP8_e4m3 symmetric, per-tensor
  • Scope: All Linear layers (excluding lm_head), activations, and KV cache
  • Group Size: 128 (block-aligned)
  • Calibration: 128 Pile samples (default)
  • Metadata: scales embedded in JSON + SafeTensors

Performance Snapshot

Metric FP16 Baseline FP8_e4m3 Quantized
Wikitext2 Perplexity 10.88 11.0
Memory Footprint 1.0× 0.56×

Quick Start

Serve with vLLM

Override model’s context:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Serve

HIP_VISIBLE_DEVICES=0
vllm serve EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV
--kv-cache-dtype fp8
----num-scheduler-steps 10 .. other arguments

Benchmark

python3 /vllm/benchmarks/benchmark_serving.py
--backend vllm
--model EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV
--dataset-name sharegpt
--dataset-path /vllm/ShareGPT_V3_unfiltered_cleaned_split.json
--num-prompts 32
--random-range-ratio 1.0
--percentile-metrics ttft,tpot,itl,e2el
--sharegpt-output-len 256

Evaluation

We benchmarked on WikiText2 using vLLM’s /v1/completions PPL metric:

  • FP16 (DeepSeek-R1-0528-Qwen3-8) → 10.88 PPL
  • FP8_e4m3 (this model) → 11.00 PPL

The ~0.12-point PPL delta yields massive ROI in memory and speed—with virtually imperceptible quality loss in most benchmarks.

License

This model reuses the DeepSeek-R1-0528-Qwen3-8B license.

Downloads last month
81
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EliovpAI/Deepseek-R1-0528-Qwen3-8B-FP8-KV

Quantized
(88)
this model