VoxCPM2 β ONNX
2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.
Part of the soniqo.audio speech toolkit β an open, runtime-portable stack for speech AI. This bundle is the ONNX Runtime export, designed to plug into the abstract interfaces in
speech-core(OnnxVoxCPM2Tts). Browse all ONNX bundles in the soniqo ONNX collection.
Use cases on soniqo.audio
ONNX export of openbmb/VoxCPM2 β a 2 B-parameter diffusion-autoregressive TTS with 48 kHz studio-quality output, reference-audio voice cloning, and natural-language voice design. Drop-in replacement for the LiteRT bundle on the synth worker side; same graph topology, same I/O contracts, runs on the ONNX Runtime CPU EP today (CUDA EP wired in the wrapper for GPU swap).
Why ONNX
This export targets ONNX Runtime as a complement to the LiteRT bundle. Both use the same four-graph split; on a CPU-only workload ONNX Runtime gives us:
- ~28 % lower peak RSS during inference (8.2 GiB vs 11.5 GiB after load, 9.3 GiB vs 13.0 GiB peak β measured on a Mac CPU, same prompt, same step count). On a memory-constrained synth pod the difference is the one between fitting and not fitting.
- ~2.4Γ lower per-step latency (110 ms vs 266 ms per AR step on the same hardware) β XNNPACK INT8 path in ORT 1.26 is more aggressive about constant-folding the dequant.
- A clean path to GPU acceleration via the CUDA EP without re-exporting the bundle.
Pipeline
VoxCPM2 is not a single feed-forward model. The runtime loop is
text + optional instruction βββΊ text-prefill
β
βΌ
repeated token-step (KV cache rolls per step)
β
βΌ
audio-decoder βββΊ 48 kHz PCM
The host owns the loop and the KV cache; ONNX owns the static tensor programs. Same split as the LiteRT bundle in this collection β same host-side wrapper code, just a different runtime backend.
Files
| File | Size | Description |
|---|---|---|
voxcpm2-text-prefill.onnx + .onnx.data |
4.2 GB | FP16-weight / FP32-compute text + instruction prefill (MiniCPM-4 KV-cache producer). max_text_tokens = 512. |
voxcpm2-token-step.onnx + .onnx.data |
4.5 GB | FP16-weight / FP32-compute autoregressive step (MiniCPM-4 + residual LM, KV-cache in/out, CFM Euler decoder). |
voxcpm2-text-prefill.int8.onnx + .int8.onnx.data |
2.6 GB | INT8 weight-only (MatMulNBits, block 32, FP32 accumulation) compact prefill. |
voxcpm2-token-step.int8.onnx + .int8.onnx.data |
3.1 GB | INT8 weight-only (MatMulNBits, block 32, FP32 accumulation) compact step. |
voxcpm2-audio-encoder.onnx |
183 MB | FP32 reference-audio encoder (16 kHz @ 6.4 s β 40 latent frames, voice-cloning only). |
voxcpm2-audio-decoder.onnx |
175 MB | FP32 AudioVAE decoder (acoustic tokens β 48 kHz PCM, 10.24 s window). |
tokenizer.json / tokenizer_config.json / special_tokens_map.json |
β | HF tokenizer bundle. |
generation_config.json / tokenization_voxcpm2.py |
β | Generation defaults + tokenizer module. |
config.json |
β | Model config (architecture, dims, IO shapes per graph). |
Precision formats. The default LM graphs store MatMul weights as FP16 and
compute in FP32 (one constant Cast per weight; ORT folds them at session
load) β output is numerically indistinguishable from the FP32 export
(cosine 1.000000 on every graph output) at half the download. The
.int8. variants quantize the same weights to INT8 via MatMulNBits
(block 32, symmetric, FP32 accumulation) for a further ~40 % size cut with
a small measured drift (prefill hidden-state cosine 0.991β0.995 vs FP32;
synthesized speech transcribes identically in ASR round-trip checks).
Activations are never quantized in either format. AudioVAE graphs stay
FP32 (Conv-heavy; INT8 rejects Conv axis remapping β same lesson as
Parakeet's decoder-joint).
The .onnx.data files are external-data sidecars (the production
weights exceed the 2 GB protobuf serialization cap). ORT's
InferenceSession auto-resolves them from the protobuf's external_data
references with no special SessionOptions.
Quick start (Python)
import onnxruntime as ort
from transformers import AutoTokenizer
bundle = "soniqo/VoxCPM2-ONNX"
tokenizer = AutoTokenizer.from_pretrained(bundle, trust_remote_code=True)
prefill = ort.InferenceSession(f"{bundle}/voxcpm2-text-prefill.onnx",
providers=["CPUExecutionProvider"])
step = ort.InferenceSession(f"{bundle}/voxcpm2-token-step.onnx",
providers=["CPUExecutionProvider"])
encoder = ort.InferenceSession(f"{bundle}/voxcpm2-audio-encoder.onnx",
providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession(f"{bundle}/voxcpm2-audio-decoder.onnx",
providers=["CPUExecutionProvider"])
# ... see the speech-core OnnxVoxCPM2Tts wrapper for the full AR loop.
For a complete reference implementation see
OnnxVoxCPM2Tts
in speech-core.
License
Apache 2.0, inherited from upstream openbmb/VoxCPM2. Apache 2.0 covers both the weights and any exported derivative; verify against the upstream model card before commercial use.
Citation
@misc{openbmb-voxcpm2,
author = {OpenBMB},
title = {{VoxCPM2}: a 2B-parameter diffusion-autoregressive multilingual TTS},
year = {2025},
howpublished = {\url{https://huggingface.co/openbmb/VoxCPM2}}
}
- Downloads last month
- 294
Model tree for soniqo/VoxCPM2-ONNX
Base model
openbmb/VoxCPM2