Model info updated

e452b1c verified 2 months ago

4.33 kB

	---
	license: apache-2.0
	tags:
	- gguf
	- qwen
	- llama.cpp
	- quantized
	- text-generation
	- reasoning
	- agent
	- chat
	- multilingual
	base_model: Qwen/Qwen3-8B
	author: geoffmunn
	---

	# Qwen3-8B-Q8_0

	Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at Q8_0 level, derived from f16 base weights.

	## Model Info

	- Format: GGUF (for llama.cpp and compatible runtimes)
	- Size: 8.71 GB
	- Precision: Q8_0
	- Base Model: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
	- Conversion Tool: [llama.cpp](https://github.com/ggerganov/llama.cpp)

	## Quality & Performance

	\| Metric \| Value \|
	\|-------\|-------\|
	\| Quality \| Lossless* \|
	\| Speed \| 🐌 Slow \|
	\| RAM Required \| ~7.1 GB \|
	\| Recommendation \| Highest quality without FP16; perfect for accuracy-critical tasks, benchmarks. \|

	## Prompt Template (ChatML)

	This model uses the ChatML format used by Qwen:

	```text
	<\|im_start\|>system
	You are a helpful assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	Set this in your app (LM Studio, OpenWebUI, etc.) for best results.

	## Generation Parameters

	### Thinking Mode (Recommended for Logic)
	Use when solving math, coding, or logical problems.

	\| Parameter \| Value \|
	\|---------\|-------\|
	\| Temperature \| 0.6 \|
	\| Top-P \| 0.95 \|
	\| Top-K \| 20 \|
	\| Min-P \| 0.0 \|
	\| Repeat Penalty \| 1.1 \|

	> ❗ DO NOT use greedy decoding — it causes infinite loops.

	Enable via:
	- `enable_thinking=True` in tokenizer
	- Or add `/think` in user input during conversation

	### Non-Thinking Mode (Fast Dialogue)
	For casual chat and quick replies.

	\| Parameter \| Value \|
	\|---------\|-------\|
	\| Temperature \| 0.7 \|
	\| Top-P \| 0.8 \|
	\| Top-K \| 20 \|
	\| Min-P \| 0.0 \|
	\| Repeat Penalty \| 1.1 \|

	Enable via:
	- `enable_thinking=False`
	- Or add `/no_think` in prompt

	Stop sequences: `<\|im_end\|>`, `<\|im_start\|>`

	## 💡 Usage Tips

	> This model supports two operational modes:
	>
	> ### 🔍 Thinking Mode (Recommended for Logic)
	> Activate with `enable_thinking=True` or append `/think` in prompt.
	>
	> - Ideal for: math, coding, planning, analysis
	> - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
	> - Avoid greedy decoding
	>
	> ### ⚡ Non-Thinking Mode (Fast Chat)
	> Use `enable_thinking=False` or `/no_think`.
	>
	> - Best for: casual conversation, quick answers
	> - Sampling: `temp=0.7`, `top_p=0.8`
	>
	> ---
	>
	> 🔄 Switch Dynamically
	> In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
	>
	> 🔁 Avoid Repetition
	> Set `presence_penalty=1.5` if stuck in loops.
	>
	> 📏 Use Full Context
	> Allow up to 32,768 output tokens for complex tasks.
	>
	> 🧰 Agent Ready
	> Works with Qwen-Agent, MCP servers, and custom tools.

	## 🖥️ CLI Example Using Ollama or TGI Server

	Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).

	```bash
	curl http://localhost:11434/api/generate -s -N -d '{
	"model": "hf.co/geoffmunn/Qwen3-8B:Q8_0",
	"prompt": "Repeat the following instruction exactly as given: Explain why the sky appears blue during the day but red at sunrise and sunset, using physics principles like Rayleigh scattering.",
	"temperature": 0.4,
	"top_p": 0.95,
	"top_k": 20,
	"min_p": 0.0,
	"repeat_penalty": 1.1,
	"stream": false
	}' \| jq -r '.response'
	```

	🎯 Why this works well:
	- The prompt is meaningful and demonstrates either reasoning, creativity, or clarity depending on quant level.
	- Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
	- Uses `jq` to extract clean output.

	> 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.

	## Verification

	Check integrity:

	```bash
	sha256sum -c ../SHA256SUMS.txt
	```

	## Usage

	Compatible with:
	- [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
	- [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
	- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
	- Directly via `llama.cpp`

	Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.

	## License

	Apache 2.0 – see base model for full terms.