L3.3-Electra-R1-70b - FP8 Dynamic Quantization
This is an FP8 quantized version of Steelskull/L3.3-Electra-R1-70b using llmcompressor with the FP8_DYNAMIC scheme.
Model Details
- Base Model: Steelskull/L3.3-Electra-R1-70b
- Quantization: FP8_DYNAMIC (W8A8)
- Format: compressed-tensors (SafeTensors)
- Memory: ~50% of original BF16 size
- Quality: <1-2% degradation on benchmarks (typical)
Quick Start
vLLM (Recommended)
pip install vllm
# Serve the model
vllm serve REPO_ID \
--max-model-len 32768 \
--gpu-memory-utilization 0.95
# Python API
from vllm import LLM
llm = LLM(model="REPO_ID")
outputs = llm.generate("Hello, how are you?")
print(outputs[0].outputs[0].text)
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"REPO_ID",
device_map="auto",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
messages = [{'role': 'user', 'content': 'Hello!'}]
inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Quantization Details
This model was quantized using:
- Tool: llmcompressor
- Method: FP8_DYNAMIC (Round-to-Nearest)
- Targets: All Linear layers except
lm_head - Scheme: W8A8 (8-bit weights and activations)
Performance
Memory Usage
- Original BF16: ~2× size of FP8
- FP8 Quantized: ~50% of original
- Savings: ~50% VRAM reduction
Inference Speed
- Expect 1.3-1.8× faster inference vs BF16
- 2× higher throughput (more KV cache available)
Use Cases
Perfect for:
- ✅ Production inference on limited VRAM
- ✅ Running larger models on single GPU
- ✅ Cost-effective API serving
- ✅ High-throughput applications
- ✅ Extended context lengths (more KV cache)
Hardware Requirements
Minimum VRAM (approximate):
- 70B model: ~40 GB (RTX A6000, A100 40GB)
- 123B model: ~70 GB (A100 80GB, H100, H200)
Recommended:
- H100/H200 for best performance
- vLLM for optimized serving
- Enable FP8 KV cache for extended context
Important Notes
⚠️ Quantization Trade-offs:
- Slight quality degradation (typically <1-2%)
- Not suitable for fine-tuning (inference only)
- Best with vLLM (has FP8 kernel optimizations)
✅ Best Practices:
- Use
--kv-cache-dtype fp8for longer contexts - Set
--gpu-memory-utilization 0.90-0.95 - Add
--enforce-eagerif you encounter compilation issues
Citation
If you use this model, please cite:
@misc{model_name-fp8,
author = {author},
title = {model_name FP8 Dynamic Quantization},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/repo_id}
}
License
Inherits license from base model: Steelskull/L3.3-Electra-R1-70b
Acknowledgments
- Base model by Steelskull
- Quantization via llmcompressor
- Serving optimized for vLLM
Want more FP8 models? Check out my other quantizations!
- Downloads last month
- 8
Model tree for sh0ck0r/L3.3-Electra-R1-70b-FP8-Dynamic
Base model
meta-llama/Llama-3.1-70B
Finetuned
meta-llama/Llama-3.3-70B-Instruct
Finetuned
EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.0
Finetuned
Steelskull/L3.3-Electra-R1-70b