File size: 4,248 Bytes
d5e67a8 253d2a2 d5e67a8 253d2a2 d5e67a8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: apache-2.0
tags:
- gguf
- qwen
- llama.cpp
- quantized
- text-generation
- reasoning
- agent
- chat
- multilingual
base_model: Qwen/Qwen3-8B
author: geoffmunn
---
# Qwen3-8B-Q3_K_S
Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q3_K_S** level, derived from **f16** base weights.
## Model Info
- **Format**: GGUF (for llama.cpp and compatible runtimes)
- **Size**: 3.77 GB
- **Precision**: Q3_K_S
- **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
## Quality & Performance
| Metric | Value |
|-------|-------|
| **Quality** | Low |
| **Speed** | β‘ Fast |
| **RAM Required** | ~3.4 GB |
| **Recommendation** | Minimal viable for simple tasks. Avoid for reasoning or multilingual use. |
## Prompt Template (ChatML)
This model uses the **ChatML** format used by Qwen:
```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```
Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
## Generation Parameters
### Thinking Mode (Recommended for Logic)
Use when solving math, coding, or logical problems.
| Parameter | Value |
|---------|-------|
| Temperature | 0.6 |
| Top-P | 0.95 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |
> β DO NOT use greedy decoding β it causes infinite loops.
Enable via:
- `enable_thinking=True` in tokenizer
- Or add `/think` in user input during conversation
### Non-Thinking Mode (Fast Dialogue)
For casual chat and quick replies.
| Parameter | Value |
|---------|-------|
| Temperature | 0.7 |
| Top-P | 0.8 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |
Enable via:
- `enable_thinking=False`
- Or add `/no_think` in prompt
Stop sequences: `<|im_end|>`, `<|im_start|>`
## π‘ Usage Tips
> This model supports two operational modes:
>
> ### π Thinking Mode (Recommended for Logic)
> Activate with `enable_thinking=True` or append `/think` in prompt.
>
> - Ideal for: math, coding, planning, analysis
> - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
> - Avoid greedy decoding
>
> ### β‘ Non-Thinking Mode (Fast Chat)
> Use `enable_thinking=False` or `/no_think`.
>
> - Best for: casual conversation, quick answers
> - Sampling: `temp=0.7`, `top_p=0.8`
>
> ---
>
> π **Switch Dynamically**
> In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
>
> π **Avoid Repetition**
> Set `presence_penalty=1.5` if stuck in loops.
>
> π **Use Full Context**
> Allow up to 32,768 output tokens for complex tasks.
>
> π§° **Agent Ready**
> Works with Qwen-Agent, MCP servers, and custom tools.
## π₯οΈ CLI Example Using Ollama or TGI Server
Hereβs how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
```bash
curl http://localhost:11434/api/generate -s -N -d '{
"model": "hf.co/geoffmunn/Qwen3-8B:Q3_K_S",
"prompt": "Repeat the following instruction exactly as given: Summarize what a neural network is in one sentence.",
"temperature": 0.5,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"repeat_penalty": 1.1,
"stream": false
}' | jq -r '.response'
```
π― **Why this works well**:
- The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
- Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
- Uses `jq` to extract clean output.
> π¬ Tip: For interactive streaming, set `"stream": true` and process line-by-line.
## Verification
Check integrity:
```bash
sha256sum -c ../SHA256SUMS.txt
```
## Usage
Compatible with:
- [LM Studio](https://lmstudio.ai) β local AI model runner with GPU acceleration
- [OpenWebUI](https://openwebui.com) β self-hosted AI platform with RAG and tools
- [GPT4All](https://gpt4all.io) β private, offline AI chatbot
- Directly via `llama.cpp`
Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
## License
Apache 2.0 β see base model for full terms.
|