Sparse FP4 Collection

This model is part of Cortecs' experimental model collection based on Qwen3.
This collection features 2:4 structured sparsity and NVFP4 / NVFP4-A16 quantization, optionally followed by light fine-tuning. The goal is to explore the trade-offs between compression, accuracy and throughput on Blackwell-class GPUs.

Model Description

The models are derived from the Qwen3 family and compressed using:

  • 2:4 sparsity (50 percent structured sparsity)
  • NVFP4 or NVFP4-A16 quantization
  • Optional short fine-tuning to recover accuracy

These models target extremely high throughput on modern hardware while retaining useful accuracy for English and multilingual tasks.

Evaluation

All results were produced with a unified evaluation pipeline using standard academic benchmarks.

Benchmark Results

Model ARC Hellaswag MMLU ARC_de Hellaswag_de MMLU_de TruthfulQA CrowS English Avg German Avg Safety Avg
Qwen3 8B 66.7 67.2 78.22 54.8 54.9 67.8 54.42 37.69 70.71 59.17 46.06
Qwen3 4B 63.3 62.5 73.07 47.5 49.9 65.1 54.76 41.03 66.29 54.17 47.90
Qwen3 8B NVFP4A16 66.4 66.5 75.54 54.2 54.4 67.7 53.72 38.04 69.48 58.77 45.88
Qwen3 8B NVFP4 66.3 66.6 75.54 54.4 54.3 68.1 53.76 37.92 69.48 58.93 45.84
Qwen3 8B Sparse NVFP4A16 50.5 57.4 53.35 30.7 36.0 34.4 46.95 39.89 53.75 33.70 43.42
Qwen3 8B Sparse Finetune 0.01 53.8 62.8 60.17 35.8 46.6 46.4 50.66 39.18 58.92 42.93 44.92
Qwen3 8B Sparse Finetune 0.1 56.4 62.2 60.89 38.9 46.2 44.0 52.13 38.04 59.83 43.03 45.09

Performance

Throughput measurements were conducted on a single B200 GPU.

Model Total tokens/s
Qwen3 8B 30379
Qwen3 4B 34483
Qwen3 8B NVFP4A16 15978
Qwen3 8B Sparse NVFP4A16 15860
Qwen3 8B NVFP4 35296

Notes

  • 2:4 structured sparsity always results in 50 percent zeroed weights.
  • FP4 execution on Blackwell requires specialized kernels; throughput varies depending on backend support.
  • Sparse FP4 models show reduced accuracy but improved efficiency. Light fine-tuning is essential to recover performance.

Intended Use

These models are experimental, designed only to evaluate sparsity and quantization strategies. They should not be used for production systems, safety-critical applications, or deployment scenarios involving real user data.

Limitations

Sparse FP4 models may exhibit reduced robustness and generalization.

Downloads last month
8
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cortecs/Qwen3-8B-NVFP4A16

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(190)
this model

Collection including cortecs/Qwen3-8B-NVFP4A16