|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- gguf |
|
|
- qwen |
|
|
- llama.cpp |
|
|
- quantized |
|
|
- text-generation |
|
|
- reasoning |
|
|
- agent |
|
|
- chat |
|
|
- multilingual |
|
|
base_model: Qwen/Qwen3-8B |
|
|
author: geoffmunn |
|
|
--- |
|
|
|
|
|
# Qwen3-8B-Q8_0 |
|
|
|
|
|
Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q8_0** level, derived from **f16** base weights. |
|
|
|
|
|
## Model Info |
|
|
|
|
|
- **Format**: GGUF (for llama.cpp and compatible runtimes) |
|
|
- **Size**: 8.71 GB |
|
|
- **Precision**: Q8_0 |
|
|
- **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) |
|
|
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp) |
|
|
|
|
|
## Quality & Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|-------|-------| |
|
|
| **Quality** | Lossless* | |
|
|
| **Speed** | π Slow | |
|
|
| **RAM Required** | ~7.1 GB | |
|
|
| **Recommendation** | Highest quality without FP16; perfect for accuracy-critical tasks, benchmarks. | |
|
|
|
|
|
## Prompt Template (ChatML) |
|
|
|
|
|
This model uses the **ChatML** format used by Qwen: |
|
|
|
|
|
```text |
|
|
<|im_start|>system |
|
|
You are a helpful assistant.<|im_end|> |
|
|
<|im_start|>user |
|
|
{prompt}<|im_end|> |
|
|
<|im_start|>assistant |
|
|
``` |
|
|
|
|
|
Set this in your app (LM Studio, OpenWebUI, etc.) for best results. |
|
|
|
|
|
## Generation Parameters |
|
|
|
|
|
### Thinking Mode (Recommended for Logic) |
|
|
Use when solving math, coding, or logical problems. |
|
|
|
|
|
| Parameter | Value | |
|
|
|---------|-------| |
|
|
| Temperature | 0.6 | |
|
|
| Top-P | 0.95 | |
|
|
| Top-K | 20 | |
|
|
| Min-P | 0.0 | |
|
|
| Repeat Penalty | 1.1 | |
|
|
|
|
|
> β DO NOT use greedy decoding β it causes infinite loops. |
|
|
|
|
|
Enable via: |
|
|
- `enable_thinking=True` in tokenizer |
|
|
- Or add `/think` in user input during conversation |
|
|
|
|
|
### Non-Thinking Mode (Fast Dialogue) |
|
|
For casual chat and quick replies. |
|
|
|
|
|
| Parameter | Value | |
|
|
|---------|-------| |
|
|
| Temperature | 0.7 | |
|
|
| Top-P | 0.8 | |
|
|
| Top-K | 20 | |
|
|
| Min-P | 0.0 | |
|
|
| Repeat Penalty | 1.1 | |
|
|
|
|
|
Enable via: |
|
|
- `enable_thinking=False` |
|
|
- Or add `/no_think` in prompt |
|
|
|
|
|
Stop sequences: `<|im_end|>`, `<|im_start|>` |
|
|
|
|
|
## π‘ Usage Tips |
|
|
|
|
|
> This model supports two operational modes: |
|
|
> |
|
|
> ### π Thinking Mode (Recommended for Logic) |
|
|
> Activate with `enable_thinking=True` or append `/think` in prompt. |
|
|
> |
|
|
> - Ideal for: math, coding, planning, analysis |
|
|
> - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20` |
|
|
> - Avoid greedy decoding |
|
|
> |
|
|
> ### β‘ Non-Thinking Mode (Fast Chat) |
|
|
> Use `enable_thinking=False` or `/no_think`. |
|
|
> |
|
|
> - Best for: casual conversation, quick answers |
|
|
> - Sampling: `temp=0.7`, `top_p=0.8` |
|
|
> |
|
|
> --- |
|
|
> |
|
|
> π **Switch Dynamically** |
|
|
> In multi-turn chats, the last `/think` or `/no_think` directive takes precedence. |
|
|
> |
|
|
> π **Avoid Repetition** |
|
|
> Set `presence_penalty=1.5` if stuck in loops. |
|
|
> |
|
|
> π **Use Full Context** |
|
|
> Allow up to 32,768 output tokens for complex tasks. |
|
|
> |
|
|
> π§° **Agent Ready** |
|
|
> Works with Qwen-Agent, MCP servers, and custom tools. |
|
|
|
|
|
## π₯οΈ CLI Example Using Ollama or TGI Server |
|
|
|
|
|
Hereβs how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference). |
|
|
|
|
|
```bash |
|
|
curl http://localhost:11434/api/generate -s -N -d '{ |
|
|
"model": "hf.co/geoffmunn/Qwen3-8B:Q8_0", |
|
|
"prompt": "Repeat the following instruction exactly as given: Explain why the sky appears blue during the day but red at sunrise and sunset, using physics principles like Rayleigh scattering.", |
|
|
"temperature": 0.4, |
|
|
"top_p": 0.95, |
|
|
"top_k": 20, |
|
|
"min_p": 0.0, |
|
|
"repeat_penalty": 1.1, |
|
|
"stream": false |
|
|
}' | jq -r '.response' |
|
|
``` |
|
|
|
|
|
π― **Why this works well**: |
|
|
- The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level. |
|
|
- Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`). |
|
|
- Uses `jq` to extract clean output. |
|
|
|
|
|
> π¬ Tip: For interactive streaming, set `"stream": true` and process line-by-line. |
|
|
|
|
|
## Verification |
|
|
|
|
|
Check integrity: |
|
|
|
|
|
```bash |
|
|
sha256sum -c ../SHA256SUMS.txt |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
Compatible with: |
|
|
- [LM Studio](https://lmstudio.ai) β local AI model runner with GPU acceleration |
|
|
- [OpenWebUI](https://openwebui.com) β self-hosted AI platform with RAG and tools |
|
|
- [GPT4All](https://gpt4all.io) β private, offline AI chatbot |
|
|
- Directly via `llama.cpp` |
|
|
|
|
|
Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations. |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 β see base model for full terms. |
|
|
|