Text-to-Speech
Safetensors
GGUF
qwen2
audio
speech
speech-language-models
conversational
neutts-air / README.md
jiamengjiameng's picture
Update README.md
fcafaa2 verified
---
license: apache-2.0
pipeline_tag: text-to-speech
tags:
- audio
- speech
- speech-language-models
datasets:
- amphion/Emilia-Dataset
- neuphonic/emilia-yodas-english-neucodec
---
# NeuTTS Air ☁️
[![NeuTTSAir_Intro](neutts-air.png)](https://www.youtube.com/watch?v=YAB3hCtu5wE)
[🚀 Spaces Demo](https://huggingface.co/spaces/neuphonic/neutts-air), [🔧 Github](https://github.com/neuphonic/neutts-air)
[Q8 GGUF version](https://huggingface.co/neuphonic/neutts-air-q8-gguf), [Q4 GGUF version](https://huggingface.co/neuphonic/neutts-air-q4-gguf)
*Created by [Neuphonic](http://neuphonic.com/) - building faster, smaller, on-device voice AI*
State-of-the-art Voice AI has been locked behind web APIs for too long. NeuTTS Air is the world’s first super-realistic, on-device, TTS speech language model with instant voice cloning. Built off a 0.5B LLM backbone, NeuTTS Air brings natural-sounding speech, real-time performance, built-in security and speaker cloning to your local device - unlocking a new category of embedded voice agents, assistants, toys, and compliance-safe apps.
## Key Features
- 🗣Best-in-class realism for its size - produces natural, ultra-realistic voices that sound human
- 📱Optimised for on-device deployment - provided in GGML format, ready to run on phones, laptops, or even Raspberry Pis
- 👫Instant voice cloning - create your own speaker with as little as 3 seconds of audio
- 🚄Simple LM + codec architecture built off a 0.5B backbone - the sweet spot between speed, size, and quality for real-world applications
> [!CAUTION]
> Websites like neutts.com are popping up and they're not affliated with Neuphonic, our github or this repo.
>
> We are on neuphonic.com only. Please be careful out there! 🙏
## Model Details
NeuTTS Air is built off Qwen 0.5B - a lightweight yet capable language model optimised for text understanding and generation - as well as a powerful combination of technologies designed for efficiency and quality:
- **Audio Codec**: [NeuCodec](https://huggingface.co/neuphonic/neucodec) - our proprietary neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook
- **Format**: Available in GGML format for efficient on-device inference
- **Responsibility**: Watermarked outputs
- **Inference Speed**: Real-time generation on mid-range devices
- **Power Consumption**: Optimised for mobile and embedded devices
## Get Started
1. **Clone the [Git Repo](https://github.com/neuphonic/neutts-air)**
```bash
git clone https://github.com/neuphonic/neutts-air.git
cd neuttsair
```
2. **Install `espeak` (required dependency)**
Please refer to the following link for instructions on how to install `espeak`:
https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md
```bash
# Mac OS
brew install espeak
# Ubuntu/Debian
sudo apt install espeak
# Arch Linux
paru -S aur/espeak
```
3. **Install Python dependencies**
The requirements file includes the dependencies needed to run the model with PyTorch. When using an ONNX decoder or a GGML model, some dependencies (such as PyTorch) are no longer required.
The inference is compatible and tested on `python>=3.11`.
```
pip install -r requirements.txt
```
## **Basic Example**
Run the basic example script to synthesize speech:
```bash
python -m examples.basic_example \
--input_text "My name is Dave, and um, I'm from London" \
--ref_audio samples/dave.wav \
--ref_text samples/dave.txt
```
To specify a particular model repo for the backbone or codec, add the `--backbone` argument. Available backbones are listed in [NeuTTS-Air huggingface collection](https://huggingface.co/collections/neuphonic/neutts-air-68cc14b7033b4c56197ef350).
Several examples are available, including a Jupyter notebook in the `examples` folder.
### **Simple One-Code Block Usage**
```python
from neuttsair.neutts import NeuTTSAir
import soundfile as sf
tts = NeuTTSAir( backbone_repo="neuphonic/neutts-air-q4-gguf", backbone_device="cpu", codec_repo="neuphonic/neucodec", codec_device="cpu")
input_text = "My name is Dave, and um, I'm from London."
ref_text = "samples/dave.txt"
ref_audio_path = "samples/dave.wav"
ref_text = open(ref_text, "r").read().strip()
ref_codes = tts.encode_reference(ref_audio_path)
wav = tts.infer(input_text, ref_codes, ref_text)
sf.write("test.wav", wav, 24000)
```
# Tips
NeuTTS Air requires two inputs:
1. A reference audio sample (`.wav` file)
2. A text string
The model then synthesises the text as speech in the style of the reference audio. This is what enables NeuTTS Air’s instant voice cloning capability.
### Example Reference Files
You can find some ready-to-use samples in the `examples` folder:
- `samples/dave.wav`
- `samples/jo.wav`
### Guidelines for Best Results
For optimal performance, reference audio samples should be:
1. **Mono channel**
2. **16-44 kHz sample rate**
3. **3–15 seconds in length**
4. **Saved as a `.wav` file**
5. **Clean** — minimal to no background noise
6. **Natural, continuous speech** — like a monologue or conversation, with few pauses, so the model can capture tone effectively
# **Responsibility**
Every audio file generated by NeuTTS Air includes [**Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth).**
# **Disclaimer**
Don't use this model to do bad things… please.