CosyVoice 3 — Norwegian LoRA fine-tune

LoRA fine-tune of Fun-CosyVoice3-0.5B-2512 on Norwegian Bokmål speech. Trained as part of an internal self-hosted TTS stack.

Release: step 31,680 · Published: 2026-06-04

What's in this release

  • model_31680_ema.pt (2,025 MB) — EMA-merged Qwen2 LLM weights, drop-in CosyVoice3 inference checkpoint.
  • model_31680_lora_state.pt (282 MB) — LoRA adapters + AdamW state + EMA shadow + step counter, for resuming fine-tunes.

The EMA-merged file is what you load for inference. The LoRA-state sidecar is for anyone who wants to continue training from this checkpoint.

Training setup

  • Stage: stage_4_llm_lora — LoRA on the Qwen2-0.5B LLM frontend only. Flow-matching decoder + HiFi-GAN inherit the base CosyVoice3 weights unchanged.

  • LoRA shape: r=32, rank-stabilized scaling (α/√r), α=11.3, wraps all 24 Qwen2 transformer blocks (7 modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). ~17.6M trainable params over ~500M base.

  • Optimizer: AdamW, peak LR 1.0e-4, warmup 1,000 steps, Noam decay.

  • Batch: 7,200 frames/GPU × grad_accum=8 (effective 57,600 frames).

  • Max grad norm: 1.

  • EMA: maintained over the LoRA delta itself (power-law warmup, β→0.9999, update_every=10). Inference checkpoint is EMA-merged.

  • Hardware: single NVIDIA RTX 3090 (24 GB).

  • Instruct prefix: a language directive (e.g. "Speak in Norwegian.") plus an optional style hint, terminated by <|endofprompt|>, sampled per-sample at tokenize time during training (with random style/full masking, CV3 §2.6.2) and supplied at inference; CV3's LLM asserts the <|endofprompt|> token (id 151646) is present.

  • Latest training loss: 0.4028 at step 31,680.

  • Total training time: ~57.1 h.

Training data

Norwegian target corpus — ~680 h, two open Norwegian Bokmål corpora from the National Library of Norway:

Source Clips Hours License
NbAiLab/NST ~219,000 ~540 Apache 2.0
NbAiLab/NPSC ~32,000 ~140 CC-0

Both Norwegian sources ship as clean per-clip audio + reference transcripts from the upstream HuggingFace datasets. Per-clip preprocessing is minimal — resample to the CosyVoice training sample rate and apply length filtering; no demucs / diarization / Whisper pseudo-labels needed because the source corpora are already clean studio recordings with verified transcripts.

English replay slice (anti-catastrophic-forgetting) — a small amount of real English read-speech is interleaved with the Norwegian data so the base model's English-instruct → English-speech pathway isn't eroded by Norwegian-only fine-tuning. Without it, the LoRA drifts toward Norwegian-accented / code-switched English and mispronounces English words embedded in Norwegian sentences:

Source Clips License
parler-tts/libritts_r_filtered 20,000 (10% of the Norwegian count) CC-BY-4.0

This mirrors CosyVoice 3's own anti-forgetting recipe (arXiv 2505.17589 §2.6): real multilingual replay data, per-utterance language tagging in the instruct prompt, and random prompt masking. The English clips are shuffled uniformly into the Norwegian data (not appended as a block), so the replay is evenly distributed across training rather than arriving in bursts.

Install

CosyVoice isn't a PyPI package — clone the upstream repo and put its vendored third_party/Matcha-TTS on sys.path before importing.

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
pip install -r requirements.txt

Pinned CosyVoice revision used to train this release: ace7c47f41bbd303aa6bf1ea80e6f9fbd595cd40. The cosy CLI assumes the Matcha-TTS submodule is reachable; the git clone --recursive step above pulls it as third_party/Matcha-TTS. The Python snippet below appends that path before the import.

Quick start

import sys
import torch
import soundfile as sf

# Make the vendored Matcha-TTS importable from the CosyVoice repo root.
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice3

# Loads the base model (will snapshot_download from ModelScope on first run).
cosy = CosyVoice3('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', fp16=True)

# Overlay the LoRA-merged EMA weights onto the base LLM.
state = torch.load('model_31680_ema.pt', map_location='cpu', weights_only=False)
state = {k: v for k, v in state.items() if k not in ('step', 'epoch')}
cosy.model.llm.load_state_dict(state, strict=False)

# Reference audio: pass a file path. CosyVoice loads + resamples internally.
ref_wav = 'path/to/reference.wav'
ref_transcript = 'transcript of the reference audio'

# CV3's LLM asserts the <|endofprompt|> token is present in prompt_text.
# The language directive matches the training-time instruct distribution
# and steers the model toward Norwegian (vs the interleaved English replay).
prompt_text = 'You are a helpful assistant. Speak in Norwegian.<|endofprompt|>' + ref_transcript

chunks = []
for out in cosy.inference_zero_shot(
    'Norsk talesyntese skal være tilgjengelig for alle.',
    prompt_text,
    ref_wav,                  # 3rd positional arg is prompt_wav (a file path)
    stream=False,
):
    chunks.append(out['tts_speech'])

audio = torch.cat(chunks, dim=1).squeeze(0).cpu().numpy()
sf.write('out.wav', audio, cosy.sample_rate)

License

This release: CC BY-NC 4.0. Research and non-commercial use only.

Three pieces of licensing apply:

  • The base model FunAudioLLM/Fun-CosyVoice3-0.5B-2512 is Apache 2.0 — that license still applies to the unmodified base weights inside the EMA-merged checkpoint. Apache 2.0 permits redistribution under additional terms for derivative work.
  • Our LoRA delta and the model card are released under CC BY-NC 4.0.
  • The training corpus (NST Apache 2.0, NPSC CC-0) would on its own permit a more permissive release; the CC BY-NC restriction is a deliberate choice by the publisher to keep the LoRA delta and the trained Norwegian behaviour off the commercial market.

If you have a specific commercial use case in mind, contact the publisher — a commercial license can be discussed case-by-case, but the default position is non-commercial.

Caveats

  • Bokmål-focused; Nynorsk and dialectal coverage is limited by the source corpus.
  • The text input must be prefixed with an instruct string terminated by <|endofprompt|> at inference time — this matches the training-time tokenization and is required for the CV3 LLM to emit speech tokens at all. Use a Norwegian language directive (e.g. "You are a helpful assistant. Speak in Norwegian.<|endofprompt|>") to steer output toward Norwegian; a bare "You are a helpful assistant.<|endofprompt|>" also satisfies the assertion but, since the model also saw an English replay slice, is less reliably Norwegian.

Auto-generated by training-cosy/scripts/publish_hf.py on 2026-06-04.

Downloads last month
222
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexKjes/cosy-norwegian

Adapter
(2)
this model