Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Abstract
A new quantization method, Micro-Rotated-GPTQ, addresses the challenges of 4-bit floating-point formats MXFP4 and NVFP4, achieving high performance and accuracy in large language model inference.
The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.
Community
In this work we present a systematic study of FP4 quantization for large language models. We examine how scale quantization and rotations affect the performance of the quantized model, and we introduce Micro‑Rotated‑GPTQ (MR‑GPTQ)—a variant of the classic GPTQ algorithm that tailors the quantization process to FP4 and outperforms the previous state‑of‑the‑art methods.
Links
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Pretraining Large Language Models with NVFP4 (2025)
- SBVR: Summation of BitVector Representation for Efficient LLM Quantization (2025)
- SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights (2025)
- Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment (2025)
- XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization (2025)
- Scaling LLM Test-Time Compute with Mobile NPU on Smartphones (2025)
- LLM Compression: How Far Can We Go in Balancing Size and Performance? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Great work!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper