File size: 7,317 Bytes
9434d3d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
# Modal Deployment for GPT-OSS vLLM
Deploy OpenAI's GPT-OSS models (20B or 120B) on [Modal.com](https://modal.com) with vLLM for efficient inference.
## 🚀 Quick Start
### 1. Install Modal CLI
```bash
# Install the Modal Python package
pip install modal
# Authenticate with Modal (opens browser)
modal setup
```
If `modal setup` doesn't work, try:
```bash
python -m modal setup
```
### 2. Create a Modal Account
1. Go to [modal.com](https://modal.com)
2. Create a free account
3. Run `modal setup` to authenticate
### 3. Deploy the GPT-OSS Model
```bash
# Navigate to the modal directory
cd modal
# Test the server (spins up a temporary instance)
modal run gpt_oss_inference.py
# Deploy to production (creates a persistent endpoint)
modal deploy gpt_oss_inference.py
```
## 📋 Configuration
### GPU Selection (Cost Optimization)
Edit `gpt_oss_inference.py` to choose your GPU tier:
```python
# Choose your GPU - uncomment the one you want:
GPU_CONFIG = "A10G" # ~$0.76/hr - RECOMMENDED for budget ✅
# GPU_CONFIG = "L4" # ~$0.59/hr - Cheapest option
# GPU_CONFIG = "A100" # ~$1.79/hr - More headroom
# GPU_CONFIG = "H100" # ~$3.95/hr - Maximum performance
```
### GPU Pricing Comparison
| GPU | VRAM | Price/hr | Best For |
| --------- | ---- | ---------- | -------------------------------- |
| L4 | 24GB | ~$0.59 | Tightest budget (may be tight) |
| **A10G** | 24GB | **~$0.76** | **Best value for GPT-OSS 20B** ✅ |
| A100 40GB | 40GB | ~$1.79 | More headroom |
| A100 80GB | 80GB | ~$2.78 | Both 20B and 120B |
| H100 | 80GB | ~$3.95 | Maximum performance |
### Model Selection
```python
# 20B model - faster, fits on A10G/L4
MODEL_NAME = "openai/gpt-oss-20b"
# 120B model - needs A100 80GB or H100
MODEL_NAME = "openai/gpt-oss-120b"
```
### Performance Tuning
```python
# FAST_BOOT = True - Faster startup, less memory (use for smaller GPUs)
# FAST_BOOT = False - Slower startup, faster inference
FAST_BOOT = True
# Data type - GPT-OSS MXFP4 quantization REQUIRES bfloat16 (float16 not supported)
# The Marlin kernel warning on A10G/L4 is expected and can be ignored
USE_FLOAT16 = False # Must be False for GPT-OSS (MXFP4 only supports bfloat16)
# Maximum model length (context window) - reduce to speed up startup
MAX_MODEL_LEN = 32768 # 32k tokens (can increase to 131072 if needed)
# Keep container warm longer to avoid cold starts
SCALEDOWN_WINDOW = 5 * MINUTES # Reduced from 10 minutes for faster warm starts
# Maximum concurrent requests (reduce for smaller GPUs)
MAX_INPUTS = 50
```
#### Startup Time Optimization
The following optimizations are enabled by default to reduce the ~1 minute startup time:
- **`--max-model-len 65536`**: Limits context window to 64k tokens (faster startup, can increase to 131072 if needed)
- **`--disable-custom-all-reduce`**: Disabled for single GPU (reduces startup overhead)
- **`--enable-prefix-caching`**: Enables prefix caching for faster subsequent requests
- **`--load-format auto`**: Auto-detects best loading format for faster model loading
- **Reduced scaledown window**: Keeps container warm for 5 minutes instead of 10 (faster warm starts)
Note: `--dtype bfloat16` is required for GPT-OSS (MXFP4 quantization only supports bf16)
## 🔧 Commands
| Command | Description |
| --------------------------------------- | ---------------------------- |
| `modal run gpt_oss_inference.py` | Test with a temporary server |
| `modal deploy gpt_oss_inference.py` | Deploy to production |
| `modal app stop gpt-oss-vllm-inference` | Stop the deployed app |
| `modal app logs gpt-oss-vllm-inference` | View deployment logs |
| `modal volume ls` | List cached volumes |
## 🌐 API Usage
Once deployed, the server exposes an OpenAI-compatible API:
### Endpoint URL
After deployment, Modal will provide a URL like:
```
https://your-workspace--gpt-oss-vllm-inference-serve.modal.run
```
### Making Requests
```python
import openai
client = openai.OpenAI(
base_url="https://your-workspace--gpt-oss-vllm-inference-serve.modal.run/v1",
api_key="not-needed" # Modal handles auth via the URL
)
response = client.chat.completions.create(
model="llm",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
```
### cURL Example
```bash
curl -X POST "https://your-workspace--gpt-oss-vllm-inference-serve.modal.run/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "llm",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
## 💰 Pricing
Modal charges per second of usage:
- **A10G GPU**: ~$0.76/hour (recommended) ✅
- **L4 GPU**: ~$0.59/hour (cheapest)
- **A100 40GB**: ~$1.79/hour
- **H100 GPU**: ~$3.95/hour (fastest)
- No charges when idle (scale to zero)
- First $30/month is free
## 📦 Model Details
### GPT-OSS 20B
- MoE architecture with efficient inference
- MXFP4 quantization for MoE layers (~10-15GB VRAM)
- Attention sink support for longer contexts
- **Fits on A10G, L4, A100, or H100** ✅
### GPT-OSS 120B
- Larger model with more capabilities
- Same quantization and architecture (~40-50GB VRAM)
- **Requires A100 80GB or H100**
## 🔍 Troubleshooting
### Authentication Issues
```bash
# Re-authenticate
modal token new
```
### GPU Availability
If your selected GPU is not available, Modal will queue your request. Tips:
- **A10G and L4** typically have better availability than H100
- Try different regions
- Use off-peak hours
- Change `GPU_CONFIG` to a different tier
### Marlin Kernel Warning
If you see: `You are running Marlin kernel with bf16 on GPUs before SM90`:
- **This warning can be safely ignored** - GPT-OSS uses MXFP4 quantization which **requires bfloat16**
- float16 is NOT supported for MXFP4 quantization (will cause a validation error)
- The warning is just a performance suggestion, but we cannot use fp16 for this model
- For optimal performance, use H100 (SM90+) which is optimized for bf16
### Startup Time Optimization
If startup takes ~1 minute:
- ✅ **Already optimized** - The code includes several optimizations:
- Uses `float16` instead of `bfloat16` for faster loading
- Limits context window to 32k tokens (faster memory allocation)
- Disables custom all-reduce for single GPU
- Enables prefix caching
- Uses auto load format detection
- To reduce startup further, you can:
- Increase `SCALEDOWN_WINDOW` to keep container warm longer (costs more)
- Use a larger GPU (A100/H100) for faster model loading
- Reduce `MAX_MODEL_LEN` if you don't need full context window
### Cache Issues
```bash
# Clear vLLM cache
modal volume rm vllm-cache
modal volume create vllm-cache
# Clear HuggingFace cache
modal volume rm huggingface-cache
modal volume create huggingface-cache
```
## 📚 Resources
- [Modal Documentation](https://modal.com/docs/guide)
- [vLLM Documentation](https://docs.vllm.ai/)
- [GPT-OSS on HuggingFace](https://huggingface.co/openai/gpt-oss-20b)
- [Modal Examples](https://modal.com/docs/examples)
|