VoxCPM2 β€” ONNX

2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the ONNX Runtime export, designed to plug into the abstract interfaces in speech-core (OnnxVoxCPM2Tts). Browse all ONNX bundles in the soniqo ONNX collection.

Use cases on soniqo.audio

ONNX export of openbmb/VoxCPM2 β€” a 2 B-parameter diffusion-autoregressive TTS with 48 kHz studio-quality output, reference-audio voice cloning, and natural-language voice design. Drop-in replacement for the LiteRT bundle on the synth worker side; same graph topology, same I/O contracts, runs on the ONNX Runtime CPU EP today (CUDA EP wired in the wrapper for GPU swap).

Why ONNX

This export targets ONNX Runtime as a complement to the LiteRT bundle. Both use the same four-graph split; on a CPU-only workload ONNX Runtime gives us:

  • ~28 % lower peak RSS during inference (8.2 GiB vs 11.5 GiB after load, 9.3 GiB vs 13.0 GiB peak β€” measured on a Mac CPU, same prompt, same step count). On a memory-constrained synth pod the difference is the one between fitting and not fitting.
  • ~2.4Γ— lower per-step latency (110 ms vs 266 ms per AR step on the same hardware) β€” XNNPACK INT8 path in ORT 1.26 is more aggressive about constant-folding the dequant.
  • A clean path to GPU acceleration via the CUDA EP without re-exporting the bundle.

Pipeline

VoxCPM2 is not a single feed-forward model. The runtime loop is

text + optional instruction ──► text-prefill
                                      β”‚
                                      β–Ό
                              repeated token-step  (KV cache rolls per step)
                                      β”‚
                                      β–Ό
                              audio-decoder ──► 48 kHz PCM

The host owns the loop and the KV cache; ONNX owns the static tensor programs. Same split as the LiteRT bundle in this collection β€” same host-side wrapper code, just a different runtime backend.

Files

File Size Description
voxcpm2-text-prefill.onnx + .onnx.data 4.2 GB FP16-weight / FP32-compute text + instruction prefill (MiniCPM-4 KV-cache producer). max_text_tokens = 512.
voxcpm2-token-step.onnx + .onnx.data 4.5 GB FP16-weight / FP32-compute autoregressive step (MiniCPM-4 + residual LM, KV-cache in/out, CFM Euler decoder).
voxcpm2-text-prefill.int8.onnx + .int8.onnx.data 2.6 GB INT8 weight-only (MatMulNBits, block 32, FP32 accumulation) compact prefill.
voxcpm2-token-step.int8.onnx + .int8.onnx.data 3.1 GB INT8 weight-only (MatMulNBits, block 32, FP32 accumulation) compact step.
voxcpm2-audio-encoder.onnx 183 MB FP32 reference-audio encoder (16 kHz @ 6.4 s β†’ 40 latent frames, voice-cloning only).
voxcpm2-audio-decoder.onnx 175 MB FP32 AudioVAE decoder (acoustic tokens β†’ 48 kHz PCM, 10.24 s window).
tokenizer.json / tokenizer_config.json / special_tokens_map.json β€” HF tokenizer bundle.
generation_config.json / tokenization_voxcpm2.py β€” Generation defaults + tokenizer module.
config.json β€” Model config (architecture, dims, IO shapes per graph).

Precision formats. The default LM graphs store MatMul weights as FP16 and compute in FP32 (one constant Cast per weight; ORT folds them at session load) β€” output is numerically indistinguishable from the FP32 export (cosine 1.000000 on every graph output) at half the download. The .int8. variants quantize the same weights to INT8 via MatMulNBits (block 32, symmetric, FP32 accumulation) for a further ~40 % size cut with a small measured drift (prefill hidden-state cosine 0.991–0.995 vs FP32; synthesized speech transcribes identically in ASR round-trip checks). Activations are never quantized in either format. AudioVAE graphs stay FP32 (Conv-heavy; INT8 rejects Conv axis remapping β€” same lesson as Parakeet's decoder-joint).

The .onnx.data files are external-data sidecars (the production weights exceed the 2 GB protobuf serialization cap). ORT's InferenceSession auto-resolves them from the protobuf's external_data references with no special SessionOptions.

Quick start (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

bundle = "soniqo/VoxCPM2-ONNX"
tokenizer = AutoTokenizer.from_pretrained(bundle, trust_remote_code=True)
prefill  = ort.InferenceSession(f"{bundle}/voxcpm2-text-prefill.onnx",
                                providers=["CPUExecutionProvider"])
step     = ort.InferenceSession(f"{bundle}/voxcpm2-token-step.onnx",
                                providers=["CPUExecutionProvider"])
encoder  = ort.InferenceSession(f"{bundle}/voxcpm2-audio-encoder.onnx",
                                providers=["CPUExecutionProvider"])
decoder  = ort.InferenceSession(f"{bundle}/voxcpm2-audio-decoder.onnx",
                                providers=["CPUExecutionProvider"])

# ... see the speech-core OnnxVoxCPM2Tts wrapper for the full AR loop.

For a complete reference implementation see OnnxVoxCPM2Tts in speech-core.

License

Apache 2.0, inherited from upstream openbmb/VoxCPM2. Apache 2.0 covers both the weights and any exported derivative; verify against the upstream model card before commercial use.

Citation

@misc{openbmb-voxcpm2,
  author = {OpenBMB},
  title  = {{VoxCPM2}: a 2B-parameter diffusion-autoregressive multilingual TTS},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/openbmb/VoxCPM2}}
}
Downloads last month
294
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/VoxCPM2-ONNX

Base model

openbmb/VoxCPM2
Quantized
(8)
this model

Collection including soniqo/VoxCPM2-ONNX