Instructions to use OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
MOSS-TTS Family
MOSS-TTS-Local-Transformer-v1.5
MOSS-TTS-Local-Transformer-v1.5 is continued from MOSS-TTS-Local-Transformer-v1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, and evaluation tables, please refer to the MOSS-TTS-Local-Transformer-v1.0 README.
Compared with MOSS-TTS-Local-Transformer-v1.0, v1.5 focuses on the following improvements:
- Higher-fidelity stereo audio modeling: v1.5 uses MOSS-Audio-Tokenizer-v2 as the audio tokenizer, supporting native 48 kHz stereo input and output for richer spatial detail and more natural perceived audio quality. Since the codec output is stereo, save the
[channels, samples]tensor returned byprocessor.decode(...)directly. - Stronger multilingual synthesis with language tags: when the
languagefield is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for exampleprocessor.build_user_message(text=text_fr, language="French"). - More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent.
- Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0.
- More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences.
- Explicit pause control: v1.5 supports inline pause markers such as
"[pause 3.2s]". For example,我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!inserts an explicit 3.2s pause before静夜思.
Supported Languages
MOSS-TTS Local Transformer v1.5 supports 31 languages. It keeps the 20 languages supported by MOSS-TTS-Local-Transformer-v1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese.
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|---|---|---|---|---|---|---|---|---|
| Chinese | zh | 🇨🇳 | Cantonese | yue | 🇭🇰 | English | en | 🇺🇸 |
| Arabic | ar | 🇸🇦 | Czech | cs | 🇨🇿 | Danish | da | 🇩🇰 |
| Dutch | nl | 🇳🇱 | Finnish | fi | 🇫🇮 | French | fr | 🇫🇷 |
| German | de | 🇩🇪 | Greek | el | 🇬🇷 | Hebrew | he | 🇮🇱 |
| Hindi | hi | 🇮🇳 | Hungarian | hu | 🇭🇺 | Italian | it | 🇮🇹 |
| Japanese | ja | 🇯🇵 | Korean | ko | 🇰🇷 | Macedonian | mk | 🇲🇰 |
| Malay | ms | 🇲🇾 | Persian (Farsi) | fa | 🇮🇷 | Polish | pl | 🇵🇱 |
| Portuguese | pt | 🇵🇹 | Romanian | ro | 🇷🇴 | Russian | ru | 🇷🇺 |
| Spanish | es | 🇪🇸 | Swahili | sw | 🇹🇿 | Swedish | sv | 🇸🇪 |
| Tagalog | tl | 🇵🇭 | Thai | th | 🇹🇭 | Turkish | tr | 🇹🇷 |
| Vietnamese | vi | 🇻🇳 |
Quick Start
Environment Setup
We recommend a clean, isolated Python environment with Transformers 5.0.0, or a recent Transformers version with Qwen3 support, to avoid dependency conflicts.
conda create -n moss-tts python=3.12 -y
conda activate moss-tts
Install all required dependencies:
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
(Optional) Install FlashAttention 2
For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]" --no-build-isolation
If your machine has limited RAM and many CPU cores, you can cap build parallelism:
MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]" --no-build-isolation
Notes:
- Dependencies are managed in
pyproject.toml, which currently pinstorch==2.9.1+cu128andtorchaudio==2.9.1+cu128. - If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.
- FlashAttention 2 is only available on supported GPUs and is typically used with
torch.float16ortorch.bfloat16.
Basic Usage
Tip: MOSS-TTS-Local-Transformer-v1.5 uses a fixed 12-codebook RVQ depth. Do not set
n_vq_for_inferenceto a value different fromconfig.n_vq.
MOSS-TTS-Local-Transformer-v1.5 provides the standard Hugging Face AutoProcessor and AutoModel interface. The examples below cover:
- Direct generation with language tags
- Voice cloning
- Duration control
- Explicit pause control with
[pause X.Ys]
from pathlib import Path
from tqdm import tqdm
import importlib.util
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor
# Disable the broken cuDNN SDPA backend on some CUDA/PyTorch combinations.
torch.backends.cuda.enable_cudnn_sdp(False)
# Keep these enabled as fallbacks.
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
def resolve_attn_implementation() -> str:
# Prefer FlashAttention 2 when package + device conditions are met.
if (
device == "cuda"
and importlib.util.find_spec("flash_attn") is not None
and dtype in {torch.float16, torch.bfloat16}
):
major, _ = torch.cuda.get_device_capability()
if major >= 8:
return "flash_attention_2"
# CUDA fallback: use PyTorch SDPA kernels.
if device == "cuda":
return "sdpa"
# CPU fallback.
return "eager"
attn_implementation = resolve_attn_implementation()
print(f"[INFO] Using attn_implementation={attn_implementation}")
processor = AutoProcessor.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=True,
)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
text_zh = "亲爱的你,愿你的每一天都值得被记住,也值得被珍惜。"
text_en = "We stand on the threshold of the AI era, where intelligence becomes an extension of human creativity."
text_fr = "Bonjour, je voudrais essayer une voix francaise naturelle et stable."
text_pause = "我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!"
# Use remote demo audio to avoid requiring local assets.
ref_audio_zh = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_en = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
conversations = [
# Direct TTS. Language tags are recommended in v1.5 when the language is known.
[processor.build_user_message(text=text_zh, language="Chinese")],
[processor.build_user_message(text=text_en, language="English")],
[processor.build_user_message(text=text_fr, language="French")],
# Explicit pause control. Use [pause X.Ys], such as [pause 3.2s].
[processor.build_user_message(text=text_pause, language="Chinese")],
# Voice cloning with a reference audio.
[processor.build_user_message(text=text_zh, reference=[ref_audio_zh], language="Chinese")],
[processor.build_user_message(text=text_en, reference=[ref_audio_en], language="English")],
# Duration control. At 12.5 frames per second, 125 frames is about 10 seconds.
[processor.build_user_message(text=text_en, tokens=125, language="English")],
]
model = AutoModel.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=True,
attn_implementation=attn_implementation,
torch_dtype=dtype,
).to(device)
model.eval()
batch_size = 1
save_dir = Path("inference_root_moss_tts_local_v1_5")
save_dir.mkdir(exist_ok=True, parents=True)
sample_idx = 0
with torch.no_grad():
for start in tqdm(range(0, len(conversations), batch_size)):
batch_conversations = conversations[start : start + batch_size]
batch = processor(batch_conversations, mode="generation")
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4096,
do_sample=True,
audio_temperature=1.7,
audio_top_p=0.8,
audio_top_k=25,
audio_repetition_penalty=1.0,
)
for message in processor.decode(outputs):
if message is None:
continue
audio = message.audio_codes_list[0]
out_path = save_dir / f"sample{sample_idx}.wav"
sample_idx += 1
# MOSS-TTS Local v1.5 codec returns stereo audio as [channels, samples].
# Save the two-channel tensor directly.
torchaudio.save(str(out_path), audio, processor.model_config.sampling_rate)
Generation Parameters
| Parameter | Recommended | Description |
|---|---|---|
audio_temperature |
1.7 |
Sampling temperature for audio RVQ layers. |
audio_top_p |
0.8 |
Nucleus sampling cutoff for audio RVQ layers. |
audio_top_k |
25 |
Top-k sampling cutoff for audio RVQ layers. |
audio_repetition_penalty |
1.0 |
Penalty for repeated acoustic token patterns. |
n_vq_for_inference |
12 |
Fixed by this release. Values other than config.n_vq are rejected. |
Notes
- This repository uses Hugging Face remote code. Load it with
trust_remote_code=True. - The MOSS-TTS-Local-Transformer-v1.5 codec is stereo.
processor.decode(...)returns audio tensors shaped as[channels, samples], so save them directly withtorchaudio.save(path, audio, sampling_rate). - Audio encoding and decoding use
OpenMOSS-Team/MOSS-Audio-Tokenizer-v2. - The model configuration sets
sampling_rateto 48000 andn_vqto 12. - If FlashAttention 2 is unavailable, the example falls back to SDPA on CUDA and eager attention on CPU.
SGLang Usage
You can serve MOSS-TTS-Local-Transformer-v1.5 with SGLang-Omni, which exposes an OpenAI-compatible /v1/audio/speech API for reference-less synthesis, zero-shot voice cloning, streaming, duration control, and language/style hints.
See the MOSS-TTS-Local cookbook for installation, full API details, deployment config, benchmarking, and limitations.
Install and Serve
Install sglang-omni by following the SGLang-Omni installation guide, then download and serve the model:
hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5
sgl-omni serve \
--model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
--port 8000
A matching config file is available in SGLang-Omni at examples/configs/moss_tts_local.yaml.
Basic Speech
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "SGLang-Omni is a great project!"}' \
--output output.wav
Voice Cloning
Provide a reference clip and its transcript for better speaker similarity. audio_path may be a local path readable by the server, an HTTP(S) URL, or a base64 data URI.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "SGLang-Omni is a great project!",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
}]
}' \
--output output.wav
ref_audio and ref_text are accepted as shorthand for references[0].audio_path and references[0].text.
Streaming
Set "stream": true, "response_format": "pcm", and "stream_format": "audio" to receive raw 48 kHz PCM chunks. Pipe the stream through ffmpeg to write a playable WAV file:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"stream": true,
"response_format": "pcm",
"stream_format": "audio"
}' \
| ffmpeg -f s16le -ar 48000 -ac 1 -i pipe:0 output_stream.wav
Duration, Markup, and Language
Duration can be guided with an inline ${token:N} prefix or with token_count / duration_tokens. Inline markup such as [pause 0.5s], Pinyin, and IPA is passed through unchanged. Use language to hint the target language and instructions for free-form style guidance.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "${token:150}今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"language": "Chinese"
}' \
--output output_markup.wav
More Usage
MOSS-TTS-Local-Transformer-v1.5 is API-compatible with MOSS-TTS-Local-Transformer-v1.0. For continuation with prefix audio, detailed UserMessage and AssistantMessage fields, generation hyperparameters, Pinyin/IPA preprocessing examples, and evaluation results, see the MOSS-TTS-Local-Transformer-v1.0.
Citation
If you use this model, please cite the MOSS-TTS Technical Report.
- Downloads last month
- 114