|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- gguf |
|
|
- qwen |
|
|
- qwen3 |
|
|
- qwen3-8b |
|
|
- qwen3-8b-gguf |
|
|
- llama.cpp |
|
|
- quantized |
|
|
- text-generation |
|
|
- reasoning |
|
|
- agent |
|
|
- chat |
|
|
- multilingual |
|
|
base_model: Qwen/Qwen3-8B |
|
|
author: geoffmunn |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
- es |
|
|
- fr |
|
|
- de |
|
|
- ru |
|
|
- ar |
|
|
- ja |
|
|
- ko |
|
|
- hi |
|
|
--- |
|
|
|
|
|
# Qwen3-8B-GGUF |
|
|
|
|
|
This is a **GGUF-quantized version** of the **[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)** language model β an **8-billion-parameter** LLM from Alibaba's Qwen series, designed for **advanced reasoning, agentic behavior, and multilingual tasks**. |
|
|
|
|
|
Converted for use with `llama.cpp` and compatible tools like OpenWebUI, LM Studio, GPT4All, and more. |
|
|
|
|
|
> π‘ **Key Features of Qwen3-8B**: |
|
|
> - π€ **Thinking Mode**: Use `enable_thinking=True` or `/think` for step-by-step logic, math, and code. |
|
|
> - β‘ **Non-Thinking Mode**: Use `/no_think` for fast, lightweight dialogue. |
|
|
> - π§° **Agent Capable**: Integrates with tools via MCP, APIs, and plugins. |
|
|
> - π **Multilingual Support**: Fluent in 100+ languages including Chinese, English, Spanish, Arabic, Japanese, etc. |
|
|
|
|
|
## Available Quantizations (from f16) |
|
|
|
|
|
These variants were built from a **f16** base model to ensure consistency across quant levels. |
|
|
|
|
|
| Level | Speed | Size | Recommendation | |
|
|
|-----------|-----------|-------------|---------------------------------------------------------------------------------------| |
|
|
| Q2_K | β‘ Fastest | 3.28 GB | Not recommended. Came first in the bat & ball question, no other appearances. | |
|
|
| π₯Q3_K_S | β‘ Fast | 3.77 GB | π₯ Came first and second in questions covering both ends of the temperature spectrum. | |
|
|
| π₯ Q3_K_M | β‘ Fast | 4.12 GB | π₯ **Best overall model.** Was a top 3 finisher for all questions except the haiku. | |
|
|
| π₯Q4_K_S | π Fast | 4.8 GB | π₯ Came first and second in questions covering both ends of the temperature spectrum. | |
|
|
| Q4_K_M | π Fast | 5.85 GB | Came first and second in questions covering high temperature questions. | |
|
|
| π₯ Q5_K_S | π’ Medium | 5.72 GB | π₯ A good second place. Good for all query types. | |
|
|
| Q5_K_M | π’ Medium | 5.85 GB | Not recommended, no appeareances in the top 3 for any question. | |
|
|
| Q6_K | π Slow | 6.73 GB | Showed up in a few results, but not recommended. | |
|
|
| Q8_0 | π Slow | 8.71 GB | Not recommended, Only one top 3 finish. | |
|
|
|
|
|
## Model anaysis and rankings |
|
|
|
|
|
There are numerous good candidates - lots of different models showed up in the top 3 across all the quesionts. However, **Qwen3-8B-f16:Q5_K_M** was a finalist in all but one question so is the recommended model. **Qwen3-8B-f16:Q5_K_S** did nearly as well and is worth considering, |
|
|
|
|
|
The 'hello' question is the first time that all models got it exactly right. All models in the 8B range did well and it's mainly a question of what one works best on your hardware. |
|
|
|
|
|
You can read the results here: [Qwen3-8b-analysis.md](Qwen3-8b-analysis.md) |
|
|
|
|
|
If you find this useful, please give the project a β€οΈ like. |
|
|
|
|
|
## Usage |
|
|
|
|
|
Load this model using: |
|
|
- [OpenWebUI](https://openwebui.com) β self-hosted AI interface with RAG & tools |
|
|
- [LM Studio](https://lmstudio.ai) β desktop app with GPU support and chat templates |
|
|
- [GPT4All](https://gpt4all.io) β private, local AI chatbot (offline-first) |
|
|
- Or directly via `llama.cpp` |
|
|
|
|
|
Each quantized model includes its own `README.md` and shares a common `MODELFILE` for optimal configuration. |
|
|
|
|
|
Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`. |
|
|
In this case try these steps: |
|
|
|
|
|
1. `wget https://huggingface.co/geoffmunn/Qwen3-4B/resolve/main/Qwen3-4B-f16%3AQ3_K_M.gguf` (replace the quantised version with the one you want) |
|
|
2. `nano Modelfile` and enter these details (again, replacing Q3_K_M with the version you want): |
|
|
```text |
|
|
FROM ./Qwen3-4B-f16:Q3_K_M.gguf |
|
|
|
|
|
# Chat template using ChatML (used by Qwen) |
|
|
SYSTEM You are a helpful assistant |
|
|
|
|
|
TEMPLATE "{{ if .System }}<|im_start|>system |
|
|
{{ .System }}<|im_end|>{{ end }}<|im_start|>user |
|
|
{{ .Prompt }}<|im_end|> |
|
|
<|im_start|>assistant |
|
|
" |
|
|
PARAMETER stop <|im_start|> |
|
|
PARAMETER stop <|im_end|> |
|
|
|
|
|
# Default sampling |
|
|
PARAMETER temperature 0.6 |
|
|
PARAMETER top_p 0.95 |
|
|
PARAMETER top_k 20 |
|
|
PARAMETER min_p 0.0 |
|
|
PARAMETER repeat_penalty 1.1 |
|
|
PARAMETER num_ctx 4096 |
|
|
``` |
|
|
|
|
|
The `num_ctx` value has been dropped to increase speed significantly. |
|
|
|
|
|
3. Then run this command: `ollama create Qwen3-4B-f16:Q3_K_M -f Modelfile` |
|
|
|
|
|
You will now see "Qwen3-4B-f16:Q3_K_M" in your Ollama model list. |
|
|
|
|
|
These import steps are also useful if you want to customise the default parameters or system prompt. |
|
|
|
|
|
## Author |
|
|
|
|
|
π€ Geoff Munn (@geoffmunn) |
|
|
π [Hugging Face Profile](https://huggingface.co/geoffmunn) |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team. |
|
|
|