Spaces:

MCP-1st-Birthday
/

eu-ai-act-chatgpt-mcp

Running

App Files Files Community

eu-ai-act-chatgpt-mcp / modal /README.md

drosatos

Deploy ChatGPT MCP Server

9434d3d 5 days ago

preview code

raw

history blame contribute delete

7.32 kB

Modal Deployment for GPT-OSS vLLM

Deploy OpenAI's GPT-OSS models (20B or 120B) on Modal.com with vLLM for efficient inference.

🚀 Quick Start

1. Install Modal CLI

# Install the Modal Python package
pip install modal

# Authenticate with Modal (opens browser)
modal setup

If modal setup doesn't work, try:

python -m modal setup

2. Create a Modal Account

Go to modal.com
Create a free account
Run modal setup to authenticate

3. Deploy the GPT-OSS Model

# Navigate to the modal directory
cd modal

# Test the server (spins up a temporary instance)
modal run gpt_oss_inference.py

# Deploy to production (creates a persistent endpoint)
modal deploy gpt_oss_inference.py

📋 Configuration

GPU Selection (Cost Optimization)

Edit gpt_oss_inference.py to choose your GPU tier:

# Choose your GPU - uncomment the one you want:
GPU_CONFIG = "A10G"  # ~$0.76/hr - RECOMMENDED for budget ✅
# GPU_CONFIG = "L4"     # ~$0.59/hr - Cheapest option
# GPU_CONFIG = "A100"   # ~$1.79/hr - More headroom
# GPU_CONFIG = "H100"   # ~$3.95/hr - Maximum performance

GPU Pricing Comparison

GPU	VRAM	Price/hr	Best For
L4	24GB	~$0.59	Tightest budget (may be tight)
A10G	24GB	~$0.76	Best value for GPT-OSS 20B ✅
A100 40GB	40GB	~$1.79	More headroom
A100 80GB	80GB	~$2.78	Both 20B and 120B
H100	80GB	~$3.95	Maximum performance

Model Selection

# 20B model - faster, fits on A10G/L4
MODEL_NAME = "openai/gpt-oss-20b"

# 120B model - needs A100 80GB or H100
MODEL_NAME = "openai/gpt-oss-120b"

Performance Tuning

# FAST_BOOT = True  - Faster startup, less memory (use for smaller GPUs)
# FAST_BOOT = False - Slower startup, faster inference
FAST_BOOT = True

# Data type - GPT-OSS MXFP4 quantization REQUIRES bfloat16 (float16 not supported)
# The Marlin kernel warning on A10G/L4 is expected and can be ignored
USE_FLOAT16 = False  # Must be False for GPT-OSS (MXFP4 only supports bfloat16)

# Maximum model length (context window) - reduce to speed up startup
MAX_MODEL_LEN = 32768  # 32k tokens (can increase to 131072 if needed)

# Keep container warm longer to avoid cold starts
SCALEDOWN_WINDOW = 5 * MINUTES  # Reduced from 10 minutes for faster warm starts

# Maximum concurrent requests (reduce for smaller GPUs)
MAX_INPUTS = 50

Startup Time Optimization

The following optimizations are enabled by default to reduce the ~1 minute startup time:

--max-model-len 65536: Limits context window to 64k tokens (faster startup, can increase to 131072 if needed)
--disable-custom-all-reduce: Disabled for single GPU (reduces startup overhead)
--enable-prefix-caching: Enables prefix caching for faster subsequent requests
--load-format auto: Auto-detects best loading format for faster model loading
Reduced scaledown window: Keeps container warm for 5 minutes instead of 10 (faster warm starts)

Note: --dtype bfloat16 is required for GPT-OSS (MXFP4 quantization only supports bf16)

🔧 Commands

Command	Description
`modal run gpt_oss_inference.py`	Test with a temporary server
`modal deploy gpt_oss_inference.py`	Deploy to production
`modal app stop gpt-oss-vllm-inference`	Stop the deployed app
`modal app logs gpt-oss-vllm-inference`	View deployment logs
`modal volume ls`	List cached volumes

🌐 API Usage

Once deployed, the server exposes an OpenAI-compatible API:

Endpoint URL

After deployment, Modal will provide a URL like:

https://your-workspace--gpt-oss-vllm-inference-serve.modal.run

Making Requests

import openai

client = openai.OpenAI(
    base_url="https://your-workspace--gpt-oss-vllm-inference-serve.modal.run/v1",
    api_key="not-needed"  # Modal handles auth via the URL
)

response = client.chat.completions.create(
    model="llm",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

cURL Example

curl -X POST "https://your-workspace--gpt-oss-vllm-inference-serve.modal.run/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llm",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

💰 Pricing

Modal charges per second of usage:

A10G GPU: ~$0.76/hour (recommended) ✅
L4 GPU: ~$0.59/hour (cheapest)
A100 40GB: ~$1.79/hour
H100 GPU: ~$3.95/hour (fastest)
No charges when idle (scale to zero)
First $30/month is free

📦 Model Details

GPT-OSS 20B

MoE architecture with efficient inference
MXFP4 quantization for MoE layers (~10-15GB VRAM)
Attention sink support for longer contexts
Fits on A10G, L4, A100, or H100 ✅

GPT-OSS 120B

Larger model with more capabilities
Same quantization and architecture (~40-50GB VRAM)
Requires A100 80GB or H100

🔍 Troubleshooting

Authentication Issues

# Re-authenticate
modal token new

GPU Availability

If your selected GPU is not available, Modal will queue your request. Tips:

A10G and L4 typically have better availability than H100
Try different regions
Use off-peak hours
Change GPU_CONFIG to a different tier

Marlin Kernel Warning

If you see: You are running Marlin kernel with bf16 on GPUs before SM90:

This warning can be safely ignored - GPT-OSS uses MXFP4 quantization which requires bfloat16
float16 is NOT supported for MXFP4 quantization (will cause a validation error)
The warning is just a performance suggestion, but we cannot use fp16 for this model
For optimal performance, use H100 (SM90+) which is optimized for bf16

Startup Time Optimization

If startup takes ~1 minute:

✅ Already optimized - The code includes several optimizations:
- Uses float16 instead of bfloat16 for faster loading
- Limits context window to 32k tokens (faster memory allocation)
- Disables custom all-reduce for single GPU
- Enables prefix caching
- Uses auto load format detection
To reduce startup further, you can:
- Increase SCALEDOWN_WINDOW to keep container warm longer (costs more)
- Use a larger GPU (A100/H100) for faster model loading
- Reduce MAX_MODEL_LEN if you don't need full context window

Cache Issues

# Clear vLLM cache
modal volume rm vllm-cache
modal volume create vllm-cache

# Clear HuggingFace cache
modal volume rm huggingface-cache
modal volume create huggingface-cache

Modal Deployment for GPT-OSS vLLM

🚀 Quick Start

1. Install Modal CLI

2. Create a Modal Account

3. Deploy the GPT-OSS Model

📋 Configuration

GPU Selection (Cost Optimization)

GPU Pricing Comparison

Model Selection

Performance Tuning

Startup Time Optimization

🔧 Commands

🌐 API Usage

Endpoint URL

Making Requests

cURL Example

💰 Pricing

📦 Model Details

GPT-OSS 20B

GPT-OSS 120B

🔍 Troubleshooting

Authentication Issues

GPU Availability

Marlin Kernel Warning

Startup Time Optimization

Cache Issues

📚 Resources