Qwen3-8B / README.md

geoffmunn

Add Q2–Q8_0 quantized models with per-model cards, MODELFILE, CLI examples, and auto-upload

d5e67a8 verified 3 months ago

preview code

raw

history blame

3.75 kB

metadata

license: apache-2.0
tags:
  - gguf
  - qwen
  - llama.cpp
  - quantized
  - text-generation
  - reasoning
  - agent
  - chat
  - multilingual
base_model: Qwen/Qwen3-8B
author: geoffmunn
pipeline_tag: text-generation
language:
  - en
  - zh
  - es
  - fr
  - de
  - ru
  - ar
  - ja
  - ko
  - hi

Qwen3-8B-GGUF

This is a GGUF-quantized version of the Qwen/Qwen3-8B language model — an 8-billion-parameter LLM from Alibaba's Qwen series, designed for advanced reasoning, agentic behavior, and multilingual tasks.

Converted for use with llama.cpp and compatible tools like OpenWebUI, LM Studio, GPT4All, and more.

💡 Key Features of Qwen3-8B:

🤔 Thinking Mode: Use enable_thinking=True or /think for step-by-step logic, math, and code.

⚡ Non-Thinking Mode: Use /no_think for fast, lightweight dialogue.

🧰 Agent Capable: Integrates with tools via MCP, APIs, and plugins.

🌍 Multilingual Support: Fluent in 100+ languages including Chinese, English, Spanish, Arabic, Japanese, etc.

Available Quantizations (from f16)

These variants were built from a f16 base model to ensure consistency across quant levels.

Level	Quality	Speed	Size	Recommendation
Q2_K	Very Low	⚡ Fastest	2.7 GB	Only on severely memory-constrained systems (<6GB RAM). Avoid for reasoning.
Q3_K_S	Low	⚡ Fast	3.1 GB	Minimal viability; basic completion only. Not recommended.
Q3_K_M	Low-Medium	⚡ Fast	3.3 GB	Acceptable for simple chat on older systems. No complex logic.
Q4_K_S	Medium	🚀 Fast	3.8 GB	Good balance for low-end laptops or embedded platforms.
Q4_K_M	✅ Balanced	🚀 Fast	4.0 GB	Best overall for general use on average hardware. Great speed/quality trade-off.
Q5_K_S	High	🐢 Medium	4.5 GB	Better reasoning; slightly faster than Q5_K_M. Ideal for coding.
Q5_K_M	✅✅ High	🐢 Medium	4.6 GB	Top pick for deep interactions, logic, and tool use. Recommended for desktops.
Q6_K	🔥 Near-FP16	🐌 Slow	5.2 GB	Excellent fidelity; ideal for RAG, retrieval, and accuracy-critical tasks.
Q8_0	🏆 Lossless*	🐌 Slow	6.8 GB	Maximum accuracy; best for research, benchmarking, or archival.

💡 Recommendations by Use Case

💻 Low-end CPU / Old Laptop: Q4_K_M (best balance under pressure)

🖥️ Standard/Mid-tier Laptop (i5/i7/M1/M2): Q5_K_M (optimal quality)

🧠 Reasoning, Coding, Math: Q5_K_M or Q6_K (use thinking mode!)

🤖 Agent & Tool Integration: Q5_K_M — handles JSON, function calls well

🔍 RAG, Retrieval, Precision Tasks: Q6_K or Q8_0

📦 Storage-Constrained Devices: Q4_K_S or Q4_K_M

🛠️ Development & Testing: Test from Q4_K_M up to Q8_0 to assess trade-offs

Usage

Load this model using:

OpenWebUI – self-hosted AI interface with RAG & tools
LM Studio – desktop app with GPU support and chat templates
GPT4All – private, local AI chatbot (offline-first)
Or directly via `llama.cpp`

Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.

Author

👤 Geoff Munn (@geoffmunn)
🔗 Hugging Face Profile

Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.