--- license: apache-2.0 base_model: Kwaipilot/KAT-Dev-72B-Exp pipeline_tag: text-generation library_name: llama.cpp language: - en tags: - gguf - quantized - ollama - coding - llama-cpp - text-generation quantized_by: richardyoung ---
# 💻 KAT-Dev 72B - GGUF ### Enterprise-Grade 72B Coding Model, Optimized for Local Inference [![GGUF](https://img.shields.io/badge/Format-GGUF-blue)](https://github.com/ggerganov/llama.cpp) [![Size](https://img.shields.io/badge/Variants-4_Quantizations-green)](https://huggingface.co/richardyoung/kat-dev-72b) [![Ollama](https://img.shields.io/badge/Runtime-Ollama-orange)](https://ollama.ai/) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) **[Original Model](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)** | **[Ollama Registry](https://ollama.com/richardyoung/kat-dev-72b)** | **[llama.cpp](https://github.com/ggerganov/llama.cpp)** ---
## 📖 What is This? This is **KAT-Dev 72B**, a powerful coding model with 72 billion parameters, quantized to **GGUF format** for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp! ### ✨ Why You'll Love It - 💻 **Coding-Focused** - Optimized specifically for programming tasks - 🧠 **72B Parameters** - Large enough for complex reasoning and refactoring - ⚡ **Local Inference** - Run entirely on your machine, no API calls - 🔒 **Privacy First** - Your code never leaves your computer - 🎯 **Multiple Quantizations** - Choose your speed/quality trade-off - 🚀 **Ollama Ready** - One command to start coding - 🔧 **llama.cpp Compatible** - Works with your favorite tools ## 🎯 Quick Start ### Option 1: Ollama (Easiest!) Pull and run directly from the Ollama registry: ```bash # Recommended: IQ3_M (best balance) ollama run richardyoung/kat-dev-72b:iq3_m # Other variants ollama run richardyoung/kat-dev-72b:iq4_xs # Better quality ollama run richardyoung/kat-dev-72b:iq2_m # Faster, smaller ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact ``` That's it! Start asking coding questions! 🎉 ### Option 2: Build from Modelfile Download this repo and build locally: ```bash # Clone or download the modelfiles ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile ollama run kat-dev-72b-iq3_m ``` ### Option 3: llama.cpp Use with llama.cpp directly: ```bash # Download the GGUF file (replace variant as needed) huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./ # Run with llama.cpp ./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to" ``` ## 💻 System Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | **RAM** | 32 GB | 64 GB+ | | **Storage** | 40 GB free | 50+ GB free | | **CPU** | Modern 8-core | 16+ cores | | **GPU** | Optional (CPU-only works!) | Metal/CUDA for acceleration | | **OS** | macOS, Linux, Windows | Latest versions | > 💡 **Tip:** Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise. ## 🎨 Available Quantizations Choose the right balance for your needs: | Quantization | Size | Quality | Speed | RAM Usage | Best For | |--------------|------|---------|-------|-----------|----------| | **IQ4_XS** | 37 GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ~50 GB | Production code, complex refactoring | | **IQ3_M** (recommended) | 33 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ~40 GB | Daily development, best balance | | **IQ2_M** | 27 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ~35 GB | Quick prototyping, fast iteration | | **IQ2_XXS** | 24 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | ~30 GB | Testing, very constrained systems | ### Variant Details | Variant | Size | Blob SHA256 | |---------|------|-------------| | `iq4_xs` | 36.98 GB | `c4cb9c6e...` | | `iq3_m` | 33.07 GB | `14d07184...` | | `iq2_m` | 27.32 GB | `cbe26a3c...` | | `iq2_xxs` | 23.74 GB | `a49c7526...` | ## 📚 Usage Examples ### Code Generation ```bash ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex" ``` ### Code Explanation ```bash ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)" ``` ### Debugging Help ```bash ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?" ``` ### Refactoring ```bash ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks" ``` ### Multi-turn Conversation ```bash ollama run richardyoung/kat-dev-72b:iq3_m >>> I need to build a REST API in Python >>> Show me a FastAPI example with authentication >>> How do I add rate limiting? ``` ## 🏗️ Model Details
Click to expand technical details ### Architecture - **Base Model:** KAT-Dev 72B Exp by Kwaipilot - **Parameters:** ~72 Billion - **Quantization:** GGUF format (IQ2_XXS to IQ4_XS) - **Context Length:** Standard (check base model for specifics) - **Optimization:** Code generation and understanding - **Training:** Specialized for programming tasks ### Supported Languages The model excels at: - Python - JavaScript/TypeScript - Java - C/C++ - Go - Rust - And many more!
## ⚡ Performance Tips
Getting the best results 1. **Choose the right quantization** - IQ3_M is recommended for daily use 2. **Use specific prompts** - "Write a Python function to X" works better than "code for X" 3. **Provide context** - Share error messages, file structures, or requirements 4. **Iterate** - Ask follow-up questions to refine the code 5. **GPU acceleration** - Use Metal (Mac) or CUDA (NVIDIA) for faster inference 6. **Temperature settings** - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions ### Example Ollama Configuration ```bash # Create with custom parameters ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile # Edit the Modelfile to add: PARAMETER temperature 0.2 PARAMETER top_p 0.9 PARAMETER repeat_penalty 1.1 ```
## 🔧 Building Custom Variants You can modify the included Modelfiles to customize behavior: ```dockerfile FROM ./kat-dev-72b-iq3_m.gguf # System prompt SYSTEM You are an expert programmer specializing in Python and web development. # Parameters PARAMETER temperature 0.2 PARAMETER num_ctx 8192 PARAMETER stop "<|endoftext|>" ``` Then build: ```bash ollama create my-custom-kat -f custom.Modelfile ``` ## ⚠️ Known Limitations - 💾 **Large Size** - Even the smallest variant needs 24+ GB of storage - 🐏 **RAM Intensive** - Requires significant system memory - ⏱️ **Inference Speed** - Slower than smaller models (trade-off for quality) - 🌐 **English-Focused** - Best performance with English prompts - 📝 **Code-Specialized** - Not optimized for general conversation ## 📄 License Apache 2.0 - Same as the original model. Free for commercial use! ## 🙏 Acknowledgments - **Original Model:** [Kwaipilot](https://huggingface.co/Kwaipilot) for creating KAT-Dev 72B - **GGUF Format:** [Georgi Gerganov](https://github.com/ggerganov) for llama.cpp - **Ollama:** [Ollama team](https://ollama.ai/) for the amazing runtime - **Community:** All the developers testing and providing feedback ## 🔗 Useful Links - 📦 **Original Model:** [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp) - 🚀 **Ollama Registry:** [richardyoung/kat-dev-72b](https://ollama.com/richardyoung/kat-dev-72b) - 🛠️ **llama.cpp:** [GitHub](https://github.com/ggerganov/llama.cpp) - 📖 **Ollama Docs:** [Documentation](https://github.com/ollama/ollama) - 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/kat-dev-72b/discussions) ## 🎮 Pro Tips
Advanced usage patterns ### 1. Integration with VS Code Use with Continue.dev or other coding assistants: ```json { "models": [ { "title": "KAT-Dev 72B", "provider": "ollama", "model": "richardyoung/kat-dev-72b:iq3_m" } ] } ``` ### 2. API Server Mode Run as an OpenAI-compatible API: ```bash ollama serve # Then use the API at http://localhost:11434 ``` ### 3. Batch Processing Process multiple files: ```bash for file in *.py; do ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review" done ```
---
**Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)** *If you find this useful, please ⭐ star the repo and share with other developers!* **Format:** GGUF | **Runtime:** Ollama / llama.cpp | **Created:** October 2025
## Hardware Requirements KAT-Dev 72B is a large coding model. Choose your quantization based on available VRAM/RAM: | Quantization | Model Size | VRAM Required | Quality | |:------------:|:----------:|:-------------:|:--------| | **Q2_K** | ~27 GB | 32 GB | Acceptable | | **Q3_K_M** | ~34 GB | 40 GB | Good | | **Q4_K_M** | ~42 GB | 48 GB | Very Good - recommended | | **Q5_K_M** | ~50 GB | 56 GB | Excellent | | **Q6_K** | ~58 GB | 64 GB | Near original | | **Q8_0** | ~77 GB | 80 GB | Original quality | ### Recommended Setups | Hardware | Recommended Quantization | |:---------|:-------------------------| | RTX 4090 (24GB) | Q2_K with offloading | | 2x RTX 4090 (48GB) | Q4_K_M | | A100 (80GB) | Q8_0 | | Mac Studio M2 Ultra (192GB) | Q8_0 via llama.cpp |