MOSS-TTS Family


    

MOSS-TTS-Local-Transformer-v1.5

MOSS-TTS-Local-Transformer-v1.5 is continued from MOSS-TTS-Local-Transformer-v1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, and evaluation tables, please refer to the MOSS-TTS-Local-Transformer-v1.0 README.

Compared with MOSS-TTS-Local-Transformer-v1.0, v1.5 focuses on the following improvements:

  • Higher-fidelity stereo audio modeling: v1.5 uses MOSS-Audio-Tokenizer-v2 as the audio tokenizer, supporting native 48 kHz stereo input and output for richer spatial detail and more natural perceived audio quality. Since the codec output is stereo, save the [channels, samples] tensor returned by processor.decode(...) directly.
  • Stronger multilingual synthesis with language tags: when the language field is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for example processor.build_user_message(text=text_fr, language="French").
  • More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent.
  • Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0.
  • More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences.
  • Explicit pause control: v1.5 supports inline pause markers such as "[pause 3.2s]". For example, 我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思! inserts an explicit 3.2s pause before 静夜思.

Supported Languages

MOSS-TTS Local Transformer v1.5 supports 31 languages. It keeps the 20 languages supported by MOSS-TTS-Local-Transformer-v1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese.

Language Code Flag Language Code Flag Language Code Flag
Chinese zh 🇨🇳 Cantonese yue 🇭🇰 English en 🇺🇸
Arabic ar 🇸🇦 Czech cs 🇨🇿 Danish da 🇩🇰
Dutch nl 🇳🇱 Finnish fi 🇫🇮 French fr 🇫🇷
German de 🇩🇪 Greek el 🇬🇷 Hebrew he 🇮🇱
Hindi hi 🇮🇳 Hungarian hu 🇭🇺 Italian it 🇮🇹
Japanese ja 🇯🇵 Korean ko 🇰🇷 Macedonian mk 🇲🇰
Malay ms 🇲🇾 Persian (Farsi) fa 🇮🇷 Polish pl 🇵🇱
Portuguese pt 🇵🇹 Romanian ro 🇷🇴 Russian ru 🇷🇺
Spanish es 🇪🇸 Swahili sw 🇹🇿 Swedish sv 🇸🇪
Tagalog tl 🇵🇭 Thai th 🇹🇭 Turkish tr 🇹🇷
Vietnamese vi 🇻🇳

Quick Start

Environment Setup

We recommend a clean, isolated Python environment with Transformers 5.0.0, or a recent Transformers version with Qwen3 support, to avoid dependency conflicts.

conda create -n moss-tts python=3.12 -y
conda activate moss-tts

Install all required dependencies:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

(Optional) Install FlashAttention 2

For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]" --no-build-isolation

If your machine has limited RAM and many CPU cores, you can cap build parallelism:

MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]" --no-build-isolation

Notes:

  • Dependencies are managed in pyproject.toml, which currently pins torch==2.9.1+cu128 and torchaudio==2.9.1+cu128.
  • If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.
  • FlashAttention 2 is only available on supported GPUs and is typically used with torch.float16 or torch.bfloat16.

Basic Usage

Tip: MOSS-TTS-Local-Transformer-v1.5 uses a fixed 12-codebook RVQ depth. Do not set n_vq_for_inference to a value different from config.n_vq.

MOSS-TTS-Local-Transformer-v1.5 provides the standard Hugging Face AutoProcessor and AutoModel interface. The examples below cover:

  1. Direct generation with language tags
  2. Voice cloning
  3. Duration control
  4. Explicit pause control with [pause X.Ys]
from pathlib import Path
from tqdm import tqdm
import importlib.util

import torch
import torchaudio
from transformers import AutoModel, AutoProcessor

# Disable the broken cuDNN SDPA backend on some CUDA/PyTorch combinations.
torch.backends.cuda.enable_cudnn_sdp(False)
# Keep these enabled as fallbacks.
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)

pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32


def resolve_attn_implementation() -> str:
    # Prefer FlashAttention 2 when package + device conditions are met.
    if (
        device == "cuda"
        and importlib.util.find_spec("flash_attn") is not None
        and dtype in {torch.float16, torch.bfloat16}
    ):
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            return "flash_attention_2"
    # CUDA fallback: use PyTorch SDPA kernels.
    if device == "cuda":
        return "sdpa"
    # CPU fallback.
    return "eager"


attn_implementation = resolve_attn_implementation()
print(f"[INFO] Using attn_implementation={attn_implementation}")

processor = AutoProcessor.from_pretrained(
    pretrained_model_name_or_path,
    trust_remote_code=True,
)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)

text_zh = "亲爱的你,愿你的每一天都值得被记住,也值得被珍惜。"
text_en = "We stand on the threshold of the AI era, where intelligence becomes an extension of human creativity."
text_fr = "Bonjour, je voudrais essayer une voix francaise naturelle et stable."
text_pause = "我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!"

# Use remote demo audio to avoid requiring local assets.
ref_audio_zh = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_en = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

conversations = [
    # Direct TTS. Language tags are recommended in v1.5 when the language is known.
    [processor.build_user_message(text=text_zh, language="Chinese")],
    [processor.build_user_message(text=text_en, language="English")],
    [processor.build_user_message(text=text_fr, language="French")],
    # Explicit pause control. Use [pause X.Ys], such as [pause 3.2s].
    [processor.build_user_message(text=text_pause, language="Chinese")],
    # Voice cloning with a reference audio.
    [processor.build_user_message(text=text_zh, reference=[ref_audio_zh], language="Chinese")],
    [processor.build_user_message(text=text_en, reference=[ref_audio_en], language="English")],
    # Duration control. At 12.5 frames per second, 125 frames is about 10 seconds.
    [processor.build_user_message(text=text_en, tokens=125, language="English")],
]

model = AutoModel.from_pretrained(
    pretrained_model_name_or_path,
    trust_remote_code=True,
    attn_implementation=attn_implementation,
    torch_dtype=dtype,
).to(device)
model.eval()

batch_size = 1
save_dir = Path("inference_root_moss_tts_local_v1_5")
save_dir.mkdir(exist_ok=True, parents=True)
sample_idx = 0

with torch.no_grad():
    for start in tqdm(range(0, len(conversations), batch_size)):
        batch_conversations = conversations[start : start + batch_size]
        batch = processor(batch_conversations, mode="generation")
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=4096,
            do_sample=True,
            audio_temperature=1.7,
            audio_top_p=0.8,
            audio_top_k=25,
            audio_repetition_penalty=1.0,
        )

        for message in processor.decode(outputs):
            if message is None:
                continue
            audio = message.audio_codes_list[0]
            out_path = save_dir / f"sample{sample_idx}.wav"
            sample_idx += 1
            # MOSS-TTS Local v1.5 codec returns stereo audio as [channels, samples].
            # Save the two-channel tensor directly.
            torchaudio.save(str(out_path), audio, processor.model_config.sampling_rate)

Generation Parameters

Parameter Recommended Description
audio_temperature 1.7 Sampling temperature for audio RVQ layers.
audio_top_p 0.8 Nucleus sampling cutoff for audio RVQ layers.
audio_top_k 25 Top-k sampling cutoff for audio RVQ layers.
audio_repetition_penalty 1.0 Penalty for repeated acoustic token patterns.
n_vq_for_inference 12 Fixed by this release. Values other than config.n_vq are rejected.

Notes

  • This repository uses Hugging Face remote code. Load it with trust_remote_code=True.
  • The MOSS-TTS-Local-Transformer-v1.5 codec is stereo. processor.decode(...) returns audio tensors shaped as [channels, samples], so save them directly with torchaudio.save(path, audio, sampling_rate).
  • Audio encoding and decoding use OpenMOSS-Team/MOSS-Audio-Tokenizer-v2.
  • The model configuration sets sampling_rate to 48000 and n_vq to 12.
  • If FlashAttention 2 is unavailable, the example falls back to SDPA on CUDA and eager attention on CPU.

SGLang Usage

You can serve MOSS-TTS-Local-Transformer-v1.5 with SGLang-Omni, which exposes an OpenAI-compatible /v1/audio/speech API for reference-less synthesis, zero-shot voice cloning, streaming, duration control, and language/style hints.

See the MOSS-TTS-Local cookbook for installation, full API details, deployment config, benchmarking, and limitations.

Install and Serve

Install sglang-omni by following the SGLang-Omni installation guide, then download and serve the model:

hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5

sgl-omni serve \
  --model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
  --port 8000

A matching config file is available in SGLang-Omni at examples/configs/moss_tts_local.yaml.

Basic Speech

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "SGLang-Omni is a great project!"}' \
  --output output.wav

Voice Cloning

Provide a reference clip and its transcript for better speaker similarity. audio_path may be a local path readable by the server, an HTTP(S) URL, or a base64 data URI.

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "SGLang-Omni is a great project!",
    "references": [{
      "audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
    }]
  }' \
  --output output.wav

ref_audio and ref_text are accepted as shorthand for references[0].audio_path and references[0].text.

Streaming

Set "stream": true, "response_format": "pcm", and "stream_format": "audio" to receive raw 48 kHz PCM chunks. Pipe the stream through ffmpeg to write a playable WAV file:

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Get the trust fund to the bank early.",
    "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
    "stream": true,
    "response_format": "pcm",
    "stream_format": "audio"
  }' \
  | ffmpeg -f s16le -ar 48000 -ac 1 -i pipe:0 output_stream.wav

Duration, Markup, and Language

Duration can be guided with an inline ${token:N} prefix or with token_count / duration_tokens. Inline markup such as [pause 0.5s], Pinyin, and IPA is passed through unchanged. Use language to hint the target language and instructions for free-form style guidance.

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "${token:150}今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
    "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
    "language": "Chinese"
  }' \
  --output output_markup.wav

More Usage

MOSS-TTS-Local-Transformer-v1.5 is API-compatible with MOSS-TTS-Local-Transformer-v1.0. For continuation with prefix audio, detailed UserMessage and AssistantMessage fields, generation hyperparameters, Pinyin/IPA preprocessing examples, and evaluation results, see the MOSS-TTS-Local-Transformer-v1.0.

Citation

If you use this model, please cite the MOSS-TTS Technical Report.

Downloads last month
114
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5