QiMing


An AI that rewrites its own rules for greater intelligence.

结果 (Result) = 模型内容 (Model Content) × 数学的平方 (Math²)


"Logic is the soul of a model, for it defines:

  • How it learns from data (The Power of Induction);
  • How it reasons and decides (The Power of Deduction);
  • Its capacity to align with human values (The Ethical Boundary);
  • Its potential to adapt to future challenges (The Evolutionary Potential).

If a model pursues nothing but sheer scale or computational power, ignoring the depth and breadth of its logic, it risks becoming a "paper tiger"—imposing on the surface, yet hollow at its core. Conversely, a model built upon elegant logic, even with fewer parameters, can unleash its true vitality in our complex world."


DISCLAIMER

The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.

This is a 20-billion-parameter foundation model (20B). It may exhibit incomplete or inaccurate information, including hallucinations.

If you find this AI too human-like, please remember: it is merely a more intelligent model — not an actual person.


Thanks mradermacher: For creating the GGUF versions of these models

https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-GGUF

https://huggingface.co/mradermacher/QiMing-Moe-20B-MXFP4-i1-GGUF

For developing the foundational model (aifeifei798/QiMing-Moe-20B-MXFP4) used in this project.

https://huggingface.co/openai

unsloth.ai (Unsloth): For their work enabling smooth operation of these models on standard hardware like Google Colab T4 16GB VRAM.

https://unsloth.ai

Thank Google Colab T4 16G


Highlights

  • Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
  • Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
  • Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
  • Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
  • Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
  • MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making QiMing-Moe-20B-MXFP4 model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.

Inference examples

Transformers

You can use QiMing-Moe-20B-MXFP4 with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.

To get started, install the necessary dependencies to setup your environment:

pip install -U transformers kernels torch 

Once, setup you can proceed to run the model by running the snippet below:

from transformers import pipeline
import torch

model_id = "aifeifei798/QiMing-Moe-20B-MXFP4"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Alternatively, you can run the model via Transformers Serve to spin up a OpenAI-compatible webserver:

transformers serve
transformers chat localhost:8000 --model-name-or-path aifeifei798/QiMing-Moe-20B-MXFP4

Learn more about how to use gpt-oss with Transformers.

vLLM

vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve aifeifei798/QiMing-Moe-20B-MXFP4

Learn more about how to use gpt-oss with vLLM.

PyTorch / Triton

To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository.

LM Studio

If you are using LM Studio you can use the following commands to download.

# QiMing-Moe-20B-MXFP4
lms get aifeifei798/QiMing-Moe-20B-MXFP4

Check out our awesome list for a broader collection of gpt-oss resources and inference partners.


Download the model

You can download the model from Hugging Face CLI:

# QiMing-Moe-20B-MXFP4
huggingface-cli download aifeifei798/QiMing-Moe-20B-MXFP4 --local-dir QiMing-Moe-20B-MXFP4/
pip install gpt-oss
python -m gpt_oss.chat QiMing-Moe-20B-MXFP4/

Reasoning levels

You can adjust the reasoning level that suits your task across three levels:

  • Low: Fast responses for general dialogue.
  • Medium: Balanced speed and detail.
  • High: Deep and detailed analysis.

The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

Tool use

The gpt-oss models are excellent for:

  • Web browsing (using built-in browsing tools)
  • Function calling with defined schemas
  • Agentic operations like browser tasks

Fine-tuning

QiMing-Moe-20B-MXFP4 models can be fine-tuned for a variety of specialized use cases.

This smaller model QiMing-Moe-20B-MXFP4 can be fine-tuned on consumer hardware

Downloads last month
18
Safetensors
Model size
2B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aifeifei798/QiMing-Moe-20B-MXFP4

Quantizations
3 models

Collection including aifeifei798/QiMing-Moe-20B-MXFP4