QiMing
An AI that rewrites its own rules for greater intelligence.
结果 (Result) = 模型内容 (Model Content) × 数学的平方 (Math²)
"Logic is the soul of a model, for it defines:
- How it learns from data (The Power of Induction);
- How it reasons and decides (The Power of Deduction);
- Its capacity to align with human values (The Ethical Boundary);
- Its potential to adapt to future challenges (The Evolutionary Potential).
If a model pursues nothing but sheer scale or computational power, ignoring the depth and breadth of its logic, it risks becoming a "paper tiger"—imposing on the surface, yet hollow at its core. Conversely, a model built upon elegant logic, even with fewer parameters, can unleash its true vitality in our complex world."
DISCLAIMER
The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.
This is a 20-billion-parameter foundation model (20B). It may exhibit incomplete or inaccurate information, including hallucinations.
If you find this AI too human-like, please remember: it is merely a more intelligent model — not an actual person.
Thanks mradermacher: For creating the GGUF versions of these models
https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-GGUF
https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-i1-GGUF
For developing the foundational model (aifeifei798/QiMing-Moe-20B-MXFP4) used in this project.
unsloth.ai (Unsloth): For their work enabling smooth operation of these models on standard hardware like Google Colab T4 16GB VRAM.
Thank Google Colab T4 16G
Highlights
- Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
- Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
- Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
- Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
- Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
- MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making
QiMing-Moe-20B-MXFP4model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.
Inference examples
Transformers
You can use QiMing-Moe-20B-MXFP4 with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.
To get started, install the necessary dependencies to setup your environment:
pip install -U transformers kernels torch
Once, setup you can proceed to run the model by running the snippet below:
from transformers import pipeline
import torch
model_id = "aifeifei798/QiMing-Moe-20B-MXFP4"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Alternatively, you can run the model via Transformers Serve to spin up a OpenAI-compatible webserver:
transformers serve
transformers chat localhost:8000 --model-name-or-path aifeifei798/QiMing-Moe-20B-MXFP4
Learn more about how to use gpt-oss with Transformers.
vLLM
vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
vllm serve aifeifei798/QiMing-Moe-20B-MXFP4
Learn more about how to use gpt-oss with vLLM.
PyTorch / Triton
To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository.
LM Studio
If you are using LM Studio you can use the following commands to download.
# QiMing-Moe-20B-MXFP4
lms get aifeifei798/QiMing-Moe-20B-MXFP4
Check out our awesome list for a broader collection of gpt-oss resources and inference partners.
Download the model
You can download the model from Hugging Face CLI:
# QiMing-Moe-20B-MXFP4
huggingface-cli download aifeifei798/QiMing-Moe-20B-MXFP4 --local-dir QiMing-Moe-20B-MXFP4/
pip install gpt-oss
python -m gpt_oss.chat QiMing-Moe-20B-MXFP4/
Reasoning levels
You can adjust the reasoning level that suits your task across three levels:
- Low: Fast responses for general dialogue.
- Medium: Balanced speed and detail.
- High: Deep and detailed analysis.
The reasoning level can be set in the system prompts, e.g., "Reasoning: high".
Tool use
The gpt-oss models are excellent for:
- Web browsing (using built-in browsing tools)
- Function calling with defined schemas
- Agentic operations like browser tasks
Fine-tuning
QiMing-Moe-20B-MXFP4 models can be fine-tuned for a variety of specialized use cases.
This smaller model QiMing-Moe-20B-MXFP4 can be fine-tuned on consumer hardware
- Downloads last month
- 18