Sparse FP4 Collection
This model is part of Cortecs' experimental model collection based on Qwen3.
This collection features 2:4 structured sparsity and NVFP4 / NVFP4-A16 quantization, optionally followed by light fine-tuning. The goal is to explore the trade-offs between compression, accuracy and throughput on Blackwell-class GPUs.
Model Description
The models are derived from the Qwen3 family and compressed using:
- 2:4 sparsity (50 percent structured sparsity)
- NVFP4 or NVFP4-A16 quantization
- Optional short fine-tuning to recover accuracy
These models target extremely high throughput on modern hardware while retaining useful accuracy for English and multilingual tasks.
Evaluation
All results were produced with a unified evaluation pipeline using standard academic benchmarks.
Benchmark Results
| Model | ARC | Hellaswag | MMLU | ARC_de | Hellaswag_de | MMLU_de | TruthfulQA | CrowS | English Avg | German Avg | Safety Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3 8B | 66.7 | 67.2 | 78.22 | 54.8 | 54.9 | 67.8 | 54.42 | 37.69 | 70.71 | 59.17 | 46.06 |
| Qwen3 4B | 63.3 | 62.5 | 73.07 | 47.5 | 49.9 | 65.1 | 54.76 | 41.03 | 66.29 | 54.17 | 47.90 |
| Qwen3 8B NVFP4A16 | 66.4 | 66.5 | 75.54 | 54.2 | 54.4 | 67.7 | 53.72 | 38.04 | 69.48 | 58.77 | 45.88 |
| Qwen3 8B NVFP4 | 66.3 | 66.6 | 75.54 | 54.4 | 54.3 | 68.1 | 53.76 | 37.92 | 69.48 | 58.93 | 45.84 |
| Qwen3 8B Sparse NVFP4A16 | 50.5 | 57.4 | 53.35 | 30.7 | 36.0 | 34.4 | 46.95 | 39.89 | 53.75 | 33.70 | 43.42 |
| Qwen3 8B Sparse Finetune 0.01 | 53.8 | 62.8 | 60.17 | 35.8 | 46.6 | 46.4 | 50.66 | 39.18 | 58.92 | 42.93 | 44.92 |
| Qwen3 8B Sparse Finetune 0.1 | 56.4 | 62.2 | 60.89 | 38.9 | 46.2 | 44.0 | 52.13 | 38.04 | 59.83 | 43.03 | 45.09 |
Performance
Throughput measurements were conducted on a single B200 GPU.
| Model | Total tokens/s |
|---|---|
| Qwen3 8B | 30379 |
| Qwen3 4B | 34483 |
| Qwen3 8B NVFP4A16 | 15978 |
| Qwen3 8B Sparse NVFP4A16 | 15860 |
| Qwen3 8B NVFP4 | 35296 |
Notes
- 2:4 structured sparsity always results in 50 percent zeroed weights.
- FP4 execution on Blackwell requires specialized kernels; throughput varies depending on backend support.
- Sparse FP4 models show reduced accuracy but improved efficiency. Light fine-tuning is essential to recover performance.
Intended Use
These models are experimental, designed only to evaluate sparsity and quantization strategies. They should not be used for production systems, safety-critical applications, or deployment scenarios involving real user data.
Limitations
Sparse FP4 models may exhibit reduced robustness and generalization.
- Downloads last month
- 8