|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- audio |
|
|
- speech |
|
|
- speech-language-models |
|
|
datasets: |
|
|
- amphion/Emilia-Dataset |
|
|
- neuphonic/emilia-yodas-english-neucodec |
|
|
--- |
|
|
|
|
|
# NeuTTS Air ☁️ |
|
|
|
|
|
[](https://www.youtube.com/watch?v=YAB3hCtu5wE) |
|
|
|
|
|
[🚀 Spaces Demo](https://huggingface.co/spaces/neuphonic/neutts-air), [🔧 Github](https://github.com/neuphonic/neutts-air) |
|
|
|
|
|
[Q8 GGUF version](https://huggingface.co/neuphonic/neutts-air-q8-gguf), [Q4 GGUF version](https://huggingface.co/neuphonic/neutts-air-q4-gguf) |
|
|
|
|
|
*Created by [Neuphonic](http://neuphonic.com/) - building faster, smaller, on-device voice AI* |
|
|
|
|
|
State-of-the-art Voice AI has been locked behind web APIs for too long. NeuTTS Air is the world’s first super-realistic, on-device, TTS speech language model with instant voice cloning. Built off a 0.5B LLM backbone, NeuTTS Air brings natural-sounding speech, real-time performance, built-in security and speaker cloning to your local device - unlocking a new category of embedded voice agents, assistants, toys, and compliance-safe apps. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- 🗣Best-in-class realism for its size - produces natural, ultra-realistic voices that sound human |
|
|
- 📱Optimised for on-device deployment - provided in GGML format, ready to run on phones, laptops, or even Raspberry Pis |
|
|
- 👫Instant voice cloning - create your own speaker with as little as 3 seconds of audio |
|
|
- 🚄Simple LM + codec architecture built off a 0.5B backbone - the sweet spot between speed, size, and quality for real-world applications |
|
|
|
|
|
|
|
|
> [!CAUTION] |
|
|
> Websites like neutts.com are popping up and they're not affliated with Neuphonic, our github or this repo. |
|
|
> |
|
|
> We are on neuphonic.com only. Please be careful out there! 🙏 |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
NeuTTS Air is built off Qwen 0.5B - a lightweight yet capable language model optimised for text understanding and generation - as well as a powerful combination of technologies designed for efficiency and quality: |
|
|
|
|
|
- **Audio Codec**: [NeuCodec](https://huggingface.co/neuphonic/neucodec) - our proprietary neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook |
|
|
- **Format**: Available in GGML format for efficient on-device inference |
|
|
- **Responsibility**: Watermarked outputs |
|
|
- **Inference Speed**: Real-time generation on mid-range devices |
|
|
- **Power Consumption**: Optimised for mobile and embedded devices |
|
|
|
|
|
## Get Started |
|
|
|
|
|
1. **Clone the [Git Repo](https://github.com/neuphonic/neutts-air)** |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/neuphonic/neutts-air.git |
|
|
cd neuttsair |
|
|
``` |
|
|
|
|
|
2. **Install `espeak` (required dependency)** |
|
|
|
|
|
Please refer to the following link for instructions on how to install `espeak`: |
|
|
|
|
|
https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md |
|
|
|
|
|
```bash |
|
|
# Mac OS |
|
|
brew install espeak |
|
|
|
|
|
# Ubuntu/Debian |
|
|
sudo apt install espeak |
|
|
|
|
|
# Arch Linux |
|
|
paru -S aur/espeak |
|
|
``` |
|
|
|
|
|
3. **Install Python dependencies** |
|
|
|
|
|
The requirements file includes the dependencies needed to run the model with PyTorch. When using an ONNX decoder or a GGML model, some dependencies (such as PyTorch) are no longer required. |
|
|
|
|
|
The inference is compatible and tested on `python>=3.11`. |
|
|
|
|
|
``` |
|
|
pip install -r requirements.txt |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## **Basic Example** |
|
|
|
|
|
Run the basic example script to synthesize speech: |
|
|
|
|
|
```bash |
|
|
python -m examples.basic_example \ |
|
|
--input_text "My name is Dave, and um, I'm from London" \ |
|
|
--ref_audio samples/dave.wav \ |
|
|
--ref_text samples/dave.txt |
|
|
|
|
|
``` |
|
|
|
|
|
To specify a particular model repo for the backbone or codec, add the `--backbone` argument. Available backbones are listed in [NeuTTS-Air huggingface collection](https://huggingface.co/collections/neuphonic/neutts-air-68cc14b7033b4c56197ef350). |
|
|
|
|
|
Several examples are available, including a Jupyter notebook in the `examples` folder. |
|
|
|
|
|
### **Simple One-Code Block Usage** |
|
|
|
|
|
```python |
|
|
from neuttsair.neutts import NeuTTSAir |
|
|
import soundfile as sf |
|
|
|
|
|
tts = NeuTTSAir( backbone_repo="neuphonic/neutts-air-q4-gguf", backbone_device="cpu", codec_repo="neuphonic/neucodec", codec_device="cpu") |
|
|
input_text = "My name is Dave, and um, I'm from London." |
|
|
|
|
|
ref_text = "samples/dave.txt" |
|
|
ref_audio_path = "samples/dave.wav" |
|
|
|
|
|
ref_text = open(ref_text, "r").read().strip() |
|
|
ref_codes = tts.encode_reference(ref_audio_path) |
|
|
|
|
|
wav = tts.infer(input_text, ref_codes, ref_text) |
|
|
sf.write("test.wav", wav, 24000) |
|
|
|
|
|
``` |
|
|
|
|
|
# Tips |
|
|
|
|
|
NeuTTS Air requires two inputs: |
|
|
|
|
|
1. A reference audio sample (`.wav` file) |
|
|
2. A text string |
|
|
|
|
|
The model then synthesises the text as speech in the style of the reference audio. This is what enables NeuTTS Air’s instant voice cloning capability. |
|
|
|
|
|
### Example Reference Files |
|
|
|
|
|
You can find some ready-to-use samples in the `examples` folder: |
|
|
|
|
|
- `samples/dave.wav` |
|
|
- `samples/jo.wav` |
|
|
|
|
|
### Guidelines for Best Results |
|
|
|
|
|
For optimal performance, reference audio samples should be: |
|
|
|
|
|
1. **Mono channel** |
|
|
2. **16-44 kHz sample rate** |
|
|
3. **3–15 seconds in length** |
|
|
4. **Saved as a `.wav` file** |
|
|
5. **Clean** — minimal to no background noise |
|
|
6. **Natural, continuous speech** — like a monologue or conversation, with few pauses, so the model can capture tone effectively |
|
|
|
|
|
# **Responsibility** |
|
|
|
|
|
Every audio file generated by NeuTTS Air includes [**Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth).** |
|
|
|
|
|
# **Disclaimer** |
|
|
|
|
|
Don't use this model to do bad things… please. |