Instructions to use cb1c7/minicpm5-psp1g-chat-style-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cb1c7/minicpm5-psp1g-chat-style-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B") model = PeftModel.from_pretrained(base_model, "cb1c7/minicpm5-psp1g-chat-style-lora") - llama-cpp-python
How to use cb1c7/minicpm5-psp1g-chat-style-lora with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="cb1c7/minicpm5-psp1g-chat-style-lora", filename="gguf/minicpm5-twitch-chat-style.F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use cb1c7/minicpm5-psp1g-chat-style-lora with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16 # Run inference directly in the terminal: llama-cli -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16 # Run inference directly in the terminal: llama-cli -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16 # Run inference directly in the terminal: ./llama-cli -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16
Use Docker
docker model run hf.co/cb1c7/minicpm5-psp1g-chat-style-lora:F16
- LM Studio
- Jan
- Ollama
How to use cb1c7/minicpm5-psp1g-chat-style-lora with Ollama:
ollama run hf.co/cb1c7/minicpm5-psp1g-chat-style-lora:F16
- Unsloth Studio
How to use cb1c7/minicpm5-psp1g-chat-style-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cb1c7/minicpm5-psp1g-chat-style-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cb1c7/minicpm5-psp1g-chat-style-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for cb1c7/minicpm5-psp1g-chat-style-lora to start chatting
- Pi
How to use cb1c7/minicpm5-psp1g-chat-style-lora with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "cb1c7/minicpm5-psp1g-chat-style-lora:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use cb1c7/minicpm5-psp1g-chat-style-lora with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default cb1c7/minicpm5-psp1g-chat-style-lora:F16
Run Hermes
hermes
- Docker Model Runner
How to use cb1c7/minicpm5-psp1g-chat-style-lora with Docker Model Runner:
docker model run hf.co/cb1c7/minicpm5-psp1g-chat-style-lora:F16
- Lemonade
How to use cb1c7/minicpm5-psp1g-chat-style-lora with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull cb1c7/minicpm5-psp1g-chat-style-lora:F16
Run and chat with the model
lemonade run user.minicpm5-psp1g-chat-style-lora-F16
List all available models
lemonade list
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default cb1c7/minicpm5-psp1g-chat-style-lora:F16Run Hermes
hermesMiniCPM5 Twitch Chat Style Bot
A compact Twitch-chat-style bot built on openbmb/MiniCPM5-1B.
Short replies, clipped reactions, emotes, slang, and usable one-line answers.
GGUF files · Live demo · Ollama · LM Studio · Chat UI · Python / PEFT · Training details
It is tuned to answer like Twitch chat without turning into long assistant paragraphs.
What To Use
| File | Best for | Notes |
|---|---|---|
gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf |
Ollama, LM Studio, local desktop apps | Recommended for most users |
gguf/minicpm5-twitch-chat-style.F16.gguf |
Higher-fidelity local GGUF use | Larger file |
adapter_model.safetensors |
Python / PEFT / Transformers | LoRA adapter only; requires openbmb/MiniCPM5-1B |
Behavior
This is a style adapter, not a general knowledge fine-tune. It is meant for chat/reply generation where the model should stay concise while still answering simple prompts.
Expected behavior:
- short replies, usually 1-12 words;
- emote-heavy Twitch cadence;
- short sentence replies when the user asks something concrete;
- no private reasoning or chain-of-thought traces.
Recommended system prompt:
You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.
Quick Start: Ollama
Download gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf, then create a Modelfile next to it:
FROM ./minicpm5-twitch-chat-style.Q4_K_M.gguf
SYSTEM "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.05
Create and run the model:
ollama create minicpm5-twitch-chat -f Modelfile
ollama run minicpm5-twitch-chat
Quick Start: LM Studio
Use the GGUF version, preferably gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf.
In LM Studio:
- open the model search/import flow;
- download or import the GGUF file;
- load it in the chat panel;
- use the system prompt above;
- start with temperature
0.7, top-p0.9, and max new tokens around32-48.
Or more easily: use your favorite local AI studio. Anything that can load a Llama-compatible GGUF should be the right starting point.
Simple Chat UI
This repo includes a tiny local web UI in webui/. It supports Ollama, LM Studio, and llama.cpp server endpoints, plus automatic emote rendering for local emote names, :colon: variants, partial colon quirks, and common global 7TV/BTTV/FFZ emotes.
Start it from the repo root:
python webui/server.py
Then open:
http://127.0.0.1:7860
Default endpoints:
| Provider | Endpoint |
|---|---|
| Ollama | http://localhost:11434/api/chat |
| LM Studio | http://localhost:1234/v1/chat/completions |
| llama.cpp server | http://localhost:8080/v1/chat/completions |
Python / PEFT Setup
Use this if you want the LoRA adapter directly instead of the merged GGUF.
pip install -U transformers peft accelerate bitsandbytes torch
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
base_model = "openbmb/MiniCPM5-1B"
adapter = "cb1c7/minicpm5-twitch-chat-style-lora"
quant = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=quant,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
Generation should use MiniCPM's chat template with thinking disabled:
messages = [
{
"role": "system",
"content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.",
},
{"role": "user", "content": "should i trust this"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=48,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.05,
pad_token_id=tokenizer.eos_token_id,
)
reply = tokenizer.decode(
output[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
).strip()
print(reply)
Data Format
Training used system/user/assistant chat rows. One representative row:
{
"messages": [
{
"role": "system",
"content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
},
{
"role": "user",
"content": "can you summarize the vibe"
},
{
"role": "assistant",
"content": "the vibe is deeply cooked"
}
]
}
Training Data
The source data was cleaned Twitch chat from three streams, mixed with a small synthetic behavior bridge so the model can answer simple prompts without losing the chat style.
| Item | Value |
|---|---|
| Final training rows | 143,485 |
| Eval rows | 512 |
| Format | system/user/assistant chat |
| Loss | response-only SFT |
| Chat template | MiniCPM, enable_thinking=False |
| Behavior bridge | 480 curated rows repeated 30x |
Cleanup removed invisible characters, pure-number spam, URL-like messages, commands, bot leaderboard/status messages, and mechanical subscription/event messages. Silly event tails with chat-style content were preserved.
Training Run
Run B was selected over Run A because Run A copied the style well but was too terse. Run B kept the short Twitch cadence while producing more relevant one-line sentence replies.
| Setting | Value |
|---|---|
| Method | Unsloth QLoRA / PEFT LoRA |
| Base model | openbmb/MiniCPM5-1B |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Learning rate | 1e-4 |
| Epochs | 1 |
| Max sequence length | 512 |
| Effective batch size | 32 |
| Train loss | 1.7102 |
| GPU | RTX 3090 24GB |
| Run B wall time | 3:52:19 |
| Run B train runtime | 13,759s |
Inference Notes
| Setting | Recommended value |
|---|---|
enable_thinking |
False |
max_new_tokens |
32-48 |
temperature |
0.7 |
top_p |
0.9 |
repetition_penalty |
1.05 |
The model is intentionally short, but it is not meant to be limited to one-token replies. If you want plain emote names instead of colon-wrapped emotes, a small display-side cleanup pass works well, for example :MONKA: to MONKA.
Conversion Notes
The GGUF files were created by merging the PEFT adapter into openbmb/MiniCPM5-1B, converting the merged model with llama.cpp, then quantizing the recommended local build to Q4_K_M.
Useful docs:
- Downloads last month
- 291
4-bit
16-bit
Model tree for cb1c7/minicpm5-psp1g-chat-style-lora
Base model
openbmb/MiniCPM5-1B
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama-server -hf cb1c7/minicpm5-psp1g-chat-style-lora:F16