license: apache-2.0
tags:
- gguf
- qwen
- llama.cpp
- quantized
- text-generation
- reasoning
- agent
- chat
- multilingual
base_model: Qwen/Qwen3-8B
author: geoffmunn
Qwen3-8B-Q8_0
Quantized version of Qwen/Qwen3-8B at Q8_0 level, derived from f16 base weights.
Model Info
- Format: GGUF (for llama.cpp and compatible runtimes)
- Size: 8.71 GB
- Precision: Q8_0
- Base Model: Qwen/Qwen3-8B
- Conversion Tool: llama.cpp
Quality & Performance
| Metric | Value |
|---|---|
| Quality | Lossless* |
| Speed | π Slow |
| RAM Required | ~7.1 GB |
| Recommendation | Highest quality without FP16; perfect for accuracy-critical tasks, benchmarks. |
Prompt Template (ChatML)
This model uses the ChatML format used by Qwen:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
Generation Parameters
Thinking Mode (Recommended for Logic)
Use when solving math, coding, or logical problems.
| Parameter | Value |
|---|---|
| Temperature | 0.6 |
| Top-P | 0.95 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |
β DO NOT use greedy decoding β it causes infinite loops.
Enable via:
enable_thinking=Truein tokenizer- Or add
/thinkin user input during conversation
Non-Thinking Mode (Fast Dialogue)
For casual chat and quick replies.
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top-P | 0.8 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |
Enable via:
enable_thinking=False- Or add
/no_thinkin prompt
Stop sequences: <|im_end|>, <|im_start|>
π‘ Usage Tips
This model supports two operational modes:
π Thinking Mode (Recommended for Logic)
Activate with
enable_thinking=Trueor append/thinkin prompt.
- Ideal for: math, coding, planning, analysis
- Use sampling:
temp=0.6,top_p=0.95,top_k=20- Avoid greedy decoding
β‘ Non-Thinking Mode (Fast Chat)
Use
enable_thinking=Falseor/no_think.
- Best for: casual conversation, quick answers
- Sampling:
temp=0.7,top_p=0.8
π Switch Dynamically
In multi-turn chats, the last/thinkor/no_thinkdirective takes precedence.π Avoid Repetition
Setpresence_penalty=1.5if stuck in loops.π Use Full Context
Allow up to 32,768 output tokens for complex tasks.π§° Agent Ready
Works with Qwen-Agent, MCP servers, and custom tools.
π₯οΈ CLI Example Using Ollama or TGI Server
Hereβs how you can query this model via API using curl and jq. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
curl http://localhost:11434/api/generate -s -N -d '{
"model": "hf.co/geoffmunn/Qwen3-8B:Q8_0",
"prompt": "Repeat the following instruction exactly as given: Explain why the sky appears blue during the day but red at sunrise and sunset, using physics principles like Rayleigh scattering.",
"temperature": 0.4,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"repeat_penalty": 1.1,
"stream": false
}' | jq -r '.response'
π― Why this works well:
- The prompt is meaningful and demonstrates either reasoning, creativity, or clarity depending on quant level.
- Temperature is tuned appropriately: lower for factual responses (
0.4), higher for creative ones (0.7). - Uses
jqto extract clean output.
π¬ Tip: For interactive streaming, set
"stream": trueand process line-by-line.
Verification
Check integrity:
sha256sum -c ../SHA256SUMS.txt
Usage
Compatible with:
- LM Studio β local AI model runner with GPU acceleration
- OpenWebUI β self-hosted AI platform with RAG and tools
- GPT4All β private, offline AI chatbot
- Directly via
llama.cpp
Supports dynamic switching between thinking modes via /think and /no_think in multi-turn conversations.
License
Apache 2.0 β see base model for full terms.