Qwen3-8B / README.md
geoffmunn's picture
Add Q2–Q8_0 quantized models with per-model cards, MODELFILE, CLI examples, and auto-upload
d5e67a8 verified
|
raw
history blame
3.75 kB
metadata
license: apache-2.0
tags:
  - gguf
  - qwen
  - llama.cpp
  - quantized
  - text-generation
  - reasoning
  - agent
  - chat
  - multilingual
base_model: Qwen/Qwen3-8B
author: geoffmunn
pipeline_tag: text-generation
language:
  - en
  - zh
  - es
  - fr
  - de
  - ru
  - ar
  - ja
  - ko
  - hi

Qwen3-8B-GGUF

This is a GGUF-quantized version of the Qwen/Qwen3-8B language model β€” an 8-billion-parameter LLM from Alibaba's Qwen series, designed for advanced reasoning, agentic behavior, and multilingual tasks.

Converted for use with llama.cpp and compatible tools like OpenWebUI, LM Studio, GPT4All, and more.

πŸ’‘ Key Features of Qwen3-8B:

  • πŸ€” Thinking Mode: Use enable_thinking=True or /think for step-by-step logic, math, and code.
  • ⚑ Non-Thinking Mode: Use /no_think for fast, lightweight dialogue.
  • 🧰 Agent Capable: Integrates with tools via MCP, APIs, and plugins.
  • 🌍 Multilingual Support: Fluent in 100+ languages including Chinese, English, Spanish, Arabic, Japanese, etc.

Available Quantizations (from f16)

These variants were built from a f16 base model to ensure consistency across quant levels.

Level Quality Speed Size Recommendation
Q2_K Very Low ⚑ Fastest 2.7 GB Only on severely memory-constrained systems (<6GB RAM). Avoid for reasoning.
Q3_K_S Low ⚑ Fast 3.1 GB Minimal viability; basic completion only. Not recommended.
Q3_K_M Low-Medium ⚑ Fast 3.3 GB Acceptable for simple chat on older systems. No complex logic.
Q4_K_S Medium πŸš€ Fast 3.8 GB Good balance for low-end laptops or embedded platforms.
Q4_K_M βœ… Balanced πŸš€ Fast 4.0 GB Best overall for general use on average hardware. Great speed/quality trade-off.
Q5_K_S High 🐒 Medium 4.5 GB Better reasoning; slightly faster than Q5_K_M. Ideal for coding.
Q5_K_M βœ…βœ… High 🐒 Medium 4.6 GB Top pick for deep interactions, logic, and tool use. Recommended for desktops.
Q6_K πŸ”₯ Near-FP16 🐌 Slow 5.2 GB Excellent fidelity; ideal for RAG, retrieval, and accuracy-critical tasks.
Q8_0 πŸ† Lossless* 🐌 Slow 6.8 GB Maximum accuracy; best for research, benchmarking, or archival.

πŸ’‘ Recommendations by Use Case

  • πŸ’» Low-end CPU / Old Laptop: Q4_K_M (best balance under pressure)
  • πŸ–₯️ Standard/Mid-tier Laptop (i5/i7/M1/M2): Q5_K_M (optimal quality)
  • 🧠 Reasoning, Coding, Math: Q5_K_M or Q6_K (use thinking mode!)
  • πŸ€– Agent & Tool Integration: Q5_K_M β€” handles JSON, function calls well
  • πŸ” RAG, Retrieval, Precision Tasks: Q6_K or Q8_0
  • πŸ“¦ Storage-Constrained Devices: Q4_K_S or Q4_K_M
  • πŸ› οΈ Development & Testing: Test from Q4_K_M up to Q8_0 to assess trade-offs

Usage

Load this model using:

  • OpenWebUI – self-hosted AI interface with RAG & tools
  • LM Studio – desktop app with GPU support and chat templates
  • GPT4All – private, local AI chatbot (offline-first)
  • Or directly via `llama.cpp`

Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.

Author

πŸ‘€ Geoff Munn (@geoffmunn)
πŸ”— Hugging Face Profile

Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.