|
|
--- |
|
|
title: VRAM Calculator |
|
|
emoji: 🧮 |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: gradio |
|
|
sdk_version: 4.44.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
short_description: Calculate VRAM requirements for HuggingFace models |
|
|
tags: |
|
|
- vram |
|
|
- gpu |
|
|
- inference |
|
|
- deployment |
|
|
--- |
|
|
|
|
|
# 🧮 VRAM & Instance Type Calculator |
|
|
|
|
|
Estimate GPU memory requirements for any HuggingFace model and get cloud instance recommendations. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Automatic model analysis**: Fetches parameter count, dtype, and architecture from HF Hub |
|
|
- **KV cache estimation**: Calculates memory for different context lengths |
|
|
- **GPU recommendations**: Shows which GPUs can run the model (RTX 3090 → H100) |
|
|
- **Cloud instance mapping**: Suggests AWS/GCP instance types with pricing |
|
|
- **Quantization guidance**: Suggests INT8/INT4 options for large models |
|
|
|
|
|
## How it works |
|
|
|
|
|
1. Fetches `safetensors` metadata for parameter count and dtype |
|
|
2. Downloads `config.json` for architecture details (layers, hidden size, KV heads) |
|
|
3. Calculates: |
|
|
- Model weights: `params × dtype_bytes` |
|
|
- KV cache: `2 × layers × batch × seq_len × kv_heads × head_dim × dtype_bytes` |
|
|
- Adds ~15% overhead for activations |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Estimates are for inference, not training |
|
|
- Actual VRAM varies by serving framework (vLLM vs TGI vs vanilla) |
|
|
- GGUF/quantized models have different memory profiles |
|
|
- Does not account for tensor parallelism across multiple GPUs |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
# Or run locally: |
|
|
pip install gradio huggingface_hub |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
## Contributing |
|
|
|
|
|
PRs welcome! Ideas for improvement: |
|
|
- Add support for GGUF models |
|
|
- Include throughput estimates |
|
|
- Add more cloud providers (Azure, Lambda Labs) |
|
|
- Support tensor parallelism calculations |
|
|
|