CosyVoice3-0.5B MLX bf16

CosyVoice 3 text-to-speech model converted to MLX safetensors format with unquantized bf16 weights for Apple Silicon inference. Includes the S3-Tokenizer-v3 reference-audio encoder needed for zero-shot voice cloning.

Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.

Swift inference: speech-swift

Variants

Variant	LLM	DiT	Total	Use case
This bundle (bf16)	bf16	bf16	~2.1 GB	Reference quality — no quantization noise anywhere
8-bit-full	int8 (group_size=64)	int8 (group_size=64)	~1.6 GB	Best quality/size trade-off
8-bit	int8 (group_size=64)	int4	~1.4 GB	Cleaner LLM logits, light DiT
4-bit	int4 (group_size=64)	int4	~1.2 GB	Smallest download / disk footprint

All bundles include the speech tokenizer and support zero-shot voice cloning. Choose bf16 when LLM/DiT quantisation noise is a problem (long-form synthesis, low-resource languages, voice cloning fidelity) and disk/RAM are not a concern.

Model Details

Component	Architecture	Size
LLM	Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads)	965 MB (bf16)
DiT Flow Matching	22-layer DiT (1024d, 16 heads, 10 ODE steps)	634 MB (bf16)
HiFi-GAN Vocoder	NSF + F0 predictor + ISTFT	79 MB (fp32)
S3-Tokenizer-v3	12-layer Conformer + FSMN + FSQ (242M params)	462 MB (bf16)
Total		~2.1 GB

Pipeline

Text          ─┐
                ├─► LLM (Qwen2.5-0.5B bf16)  ─► Speech tokens (FSQ 6561)
Ref transcript ┘                                           │
                                                           ▼
                              ┌─► prompt_token ─┐
Reference WAV ─► S3-Tokenizer-v3                ├─► DiT Flow Matching ─► Mel
              ─► Matcha mel    ─► prompt_feat ─┘    (cond + spk_emb, bf16)    │
              ─► CAM++         ─► flow_embedding                              ▼
                                                                          HiFi-GAN
                                                                              │
                                                                              ▼
                                                                         Audio (24 kHz)

Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian.

Files

llm.safetensors — LLM weights (bf16, unquantised)
flow.safetensors — DiT flow matching decoder (bf16, unquantised)
hifigan.safetensors — HiFi-GAN vocoder (fp32, weight-norm folded)
speech_tokenizer.safetensors — S3-Tokenizer-v3 reference encoder (bf16)
config.json — Model configuration (tokenizer + frame rates)
vocab.json / merges.txt / tokenizer_config.json — Qwen2.5 BPE tokenizer

Conversion Details

LLM: bf16 throughout (no group quantisation applied)
Flow / DiT: bf16 throughout (no group quantisation applied)
HiFi-GAN: fp32 with weight normalization folded (w = g * v / ||v||)
Speech tokenizer: bf16 (runs once per voice profile, accuracy outweighs disk size)
Conv1d weights transposed from PyTorch [out, in, kernel] to MLX [out, kernel, in]

Zero-Shot Voice Cloning

For best clone quality the LLM needs both the reference's acoustic prefix AND its text transcript. Upstream's inference_zero_shot feeds the LLM concat(prompt_text, content_text) plus the reference's FSQ codes as autoregressive prefix; this bundle ships everything you need for that path.

import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained(
    modelId: "aufklarer/CosyVoice3-0.5B-MLX-bf16"
)

let result = try await model.synthesize(
    text: "你好，欢迎来到 CosyVoice 三。",
    referenceWAV: refURL,
    referenceTranscript: "床前明月光，疑是地上霜。",
)

Source

Upstream: FunAudioLLM/Fun-CosyVoice3-0.5B-2512 Paper: CosyVoice 3 (arXiv:2505.17589)

License

Apache 2.0 (inherited from upstream).

Downloads last month: 53

MLX

Hardware compatibility

Quantized

Model tree for aufklarer/CosyVoice3-0.5B-MLX-bf16

Base model

FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Finetuned

(11)

this model

Collection including aufklarer/CosyVoice3-0.5B-MLX-bf16

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated 1 day ago • 5

Paper for aufklarer/CosyVoice3-0.5B-MLX-bf16

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Paper • 2505.17589 • Published May 23, 2025 • 6

aufklarer
/

CosyVoice3-0.5B-MLX-bf16