File size: 4,248 Bytes

---
license: apache-2.0
tags:
  - gguf
  - qwen
  - llama.cpp
  - quantized
  - text-generation
  - reasoning
  - agent
  - chat
  - multilingual
base_model: Qwen/Qwen3-8B
author: geoffmunn
---

# Qwen3-8B-Q3_K_S

Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q3_K_S** level, derived from **f16** base weights.

## Model Info

- **Format**: GGUF (for llama.cpp and compatible runtimes)
- **Size**: 3.77 GB
- **Precision**: Q3_K_S
- **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)

## Quality & Performance

| Metric | Value |
|-------|-------|
| **Quality** | Low |
| **Speed** | ⚡ Fast |
| **RAM Required** | ~3.4 GB |
| **Recommendation** | Minimal viable for simple tasks. Avoid for reasoning or multilingual use. |

## Prompt Template (ChatML)

This model uses the **ChatML** format used by Qwen:

```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```

Set this in your app (LM Studio, OpenWebUI, etc.) for best results.

## Generation Parameters

### Thinking Mode (Recommended for Logic)
Use when solving math, coding, or logical problems.

| Parameter | Value |
|---------|-------|
| Temperature | 0.6 |
| Top-P | 0.95 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |

> ❗ DO NOT use greedy decoding — it causes infinite loops.

Enable via:
- `enable_thinking=True` in tokenizer
- Or add `/think` in user input during conversation

### Non-Thinking Mode (Fast Dialogue)
For casual chat and quick replies.

| Parameter | Value |
|---------|-------|
| Temperature | 0.7 |
| Top-P | 0.8 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |

Enable via:
- `enable_thinking=False`
- Or add `/no_think` in prompt

Stop sequences: `<|im_end|>`, `<|im_start|>`

## 💡 Usage Tips

> This model supports two operational modes:
>
> ### 🔍 Thinking Mode (Recommended for Logic)
> Activate with `enable_thinking=True` or append `/think` in prompt.
>
> - Ideal for: math, coding, planning, analysis
> - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
> - Avoid greedy decoding
>
> ### ⚡ Non-Thinking Mode (Fast Chat)
> Use `enable_thinking=False` or `/no_think`.
>
> - Best for: casual conversation, quick answers
> - Sampling: `temp=0.7`, `top_p=0.8`
>
> ---
>
> 🔄 **Switch Dynamically**  
> In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
>
> 🔁 **Avoid Repetition**  
> Set `presence_penalty=1.5` if stuck in loops.
>
> 📏 **Use Full Context**  
> Allow up to 32,768 output tokens for complex tasks.
>
> 🧰 **Agent Ready**  
> Works with Qwen-Agent, MCP servers, and custom tools.

## 🖥️ CLI Example Using Ollama or TGI Server

Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).

```bash
curl http://localhost:11434/api/generate -s -N -d '{
  "model": "hf.co/geoffmunn/Qwen3-8B:Q3_K_S",
  "prompt": "Repeat the following instruction exactly as given: Summarize what a neural network is in one sentence.",
  "temperature": 0.5,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "repeat_penalty": 1.1,
  "stream": false
}' | jq -r '.response'
```

🎯 **Why this works well**:
- The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
- Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
- Uses `jq` to extract clean output.

> 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.

## Verification

Check integrity:

```bash
sha256sum -c ../SHA256SUMS.txt
```

## Usage

Compatible with:
- [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
- [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
- Directly via `llama.cpp`

Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.

## License

Apache 2.0 – see base model for full terms.