CosyVoice3-0.5B MLX bf16

CosyVoice 3 text-to-speech model converted to MLX safetensors format with unquantized bf16 weights for Apple Silicon inference. Includes the S3-Tokenizer-v3 reference-audio encoder needed for zero-shot voice cloning.

Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.

Swift inference: speech-swift

Variants

Variant LLM DiT Total Use case
This bundle (bf16) bf16 bf16 ~2.1 GB Reference quality — no quantization noise anywhere
8-bit-full int8 (group_size=64) int8 (group_size=64) ~1.6 GB Best quality/size trade-off
8-bit int8 (group_size=64) int4 ~1.4 GB Cleaner LLM logits, light DiT
4-bit int4 (group_size=64) int4 ~1.2 GB Smallest download / disk footprint

All bundles include the speech tokenizer and support zero-shot voice cloning. Choose bf16 when LLM/DiT quantisation noise is a problem (long-form synthesis, low-resource languages, voice cloning fidelity) and disk/RAM are not a concern.

Model Details

Component Architecture Size
LLM Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) 965 MB (bf16)
DiT Flow Matching 22-layer DiT (1024d, 16 heads, 10 ODE steps) 634 MB (bf16)
HiFi-GAN Vocoder NSF + F0 predictor + ISTFT 79 MB (fp32)
S3-Tokenizer-v3 12-layer Conformer + FSMN + FSQ (242M params) 462 MB (bf16)
Total ~2.1 GB

Pipeline

Text          ─┐
                ├─► LLM (Qwen2.5-0.5B bf16)  ─► Speech tokens (FSQ 6561)
Ref transcript ┘                                           │
                                                           ▼
                              ┌─► prompt_token ─┐
Reference WAV ─► S3-Tokenizer-v3                ├─► DiT Flow Matching ─► Mel
              ─► Matcha mel    ─► prompt_feat ─┘    (cond + spk_emb, bf16)    │
              ─► CAM++         ─► flow_embedding                              ▼
                                                                          HiFi-GAN
                                                                              │
                                                                              ▼
                                                                         Audio (24 kHz)

Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian.

Files

  • llm.safetensors — LLM weights (bf16, unquantised)
  • flow.safetensors — DiT flow matching decoder (bf16, unquantised)
  • hifigan.safetensors — HiFi-GAN vocoder (fp32, weight-norm folded)
  • speech_tokenizer.safetensors — S3-Tokenizer-v3 reference encoder (bf16)
  • config.json — Model configuration (tokenizer + frame rates)
  • vocab.json / merges.txt / tokenizer_config.json — Qwen2.5 BPE tokenizer

Conversion Details

  • LLM: bf16 throughout (no group quantisation applied)
  • Flow / DiT: bf16 throughout (no group quantisation applied)
  • HiFi-GAN: fp32 with weight normalization folded (w = g * v / ||v||)
  • Speech tokenizer: bf16 (runs once per voice profile, accuracy outweighs disk size)
  • Conv1d weights transposed from PyTorch [out, in, kernel] to MLX [out, kernel, in]

Zero-Shot Voice Cloning

For best clone quality the LLM needs both the reference's acoustic prefix AND its text transcript. Upstream's inference_zero_shot feeds the LLM concat(prompt_text, content_text) plus the reference's FSQ codes as autoregressive prefix; this bundle ships everything you need for that path.

import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained(
    modelId: "aufklarer/CosyVoice3-0.5B-MLX-bf16"
)

let result = try await model.synthesize(
    text: "你好,欢迎来到 CosyVoice 三。",
    referenceWAV: refURL,
    referenceTranscript: "床前明月光,疑是地上霜。",
)

Source

Upstream: FunAudioLLM/Fun-CosyVoice3-0.5B-2512 Paper: CosyVoice 3 (arXiv:2505.17589)

Links

License

Apache 2.0 (inherited from upstream).

Downloads last month
53
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/CosyVoice3-0.5B-MLX-bf16

Finetuned
(11)
this model

Collection including aufklarer/CosyVoice3-0.5B-MLX-bf16

Paper for aufklarer/CosyVoice3-0.5B-MLX-bf16