Upload 2 files
Browse files- README.md +204 -0
- requirements.txt +6 -0
README.md
CHANGED
|
@@ -1,3 +1,207 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- text-to-speech
|
| 7 |
+
- tts
|
| 8 |
+
- voice-cloning
|
| 9 |
+
- emotion-control
|
| 10 |
+
- speech-synthesis
|
| 11 |
+
- packed-model
|
| 12 |
+
- pytorch
|
| 13 |
---
|
| 14 |
+
|
| 15 |
+
# PackedTTS
|
| 16 |
+
|
| 17 |
+
PackedTTS is a self-contained text-to-speech runtime bundle that packages the full synthesis stack into a single `tts.pt` file.
|
| 18 |
+
|
| 19 |
+
The bundle is designed to be loaded directly by the runtime script and used without rebuilding the model stack. It stores the model weights, tokenizer data, packed voices, packed emotions, resolution indexes, and runtime defaults in one artifact.
|
| 20 |
+
|
| 21 |
+
The example bundle in this repo is intended to be used as-is and currently includes a voice set and emotion set.
|
| 22 |
+
|
| 23 |
+
## What is included
|
| 24 |
+
|
| 25 |
+
`tts.pt` contains:
|
| 26 |
+
|
| 27 |
+
- T3 weights
|
| 28 |
+
- S3Gen weights
|
| 29 |
+
- VoiceEncoder weights
|
| 30 |
+
- tokenizer JSON
|
| 31 |
+
- packed voices
|
| 32 |
+
- packed emotions
|
| 33 |
+
- lookup indexes
|
| 34 |
+
- default resolution settings
|
| 35 |
+
|
| 36 |
+
This is not a training checkpoint meant to be unpacked and rebuilt from ingredients. It is the runtime artifact.
|
| 37 |
+
|
| 38 |
+
## Repository contents
|
| 39 |
+
|
| 40 |
+
- `tts.pt` β packed TTS bundle
|
| 41 |
+
- `PackedTTS.py` β runtime loader, resolver, and inference script
|
| 42 |
+
- `requirements.txt` β Python dependencies for the runtime
|
| 43 |
+
- `README.md` β usage and overview
|
| 44 |
+
|
| 45 |
+
## Features
|
| 46 |
+
|
| 47 |
+
- Single-file bundle loading
|
| 48 |
+
- Voice selection by name
|
| 49 |
+
- Emotion selection by name
|
| 50 |
+
- Fuzzy matching for names
|
| 51 |
+
- Optional reference-audio overrides
|
| 52 |
+
- Packed voices and emotions inside one artifact
|
| 53 |
+
- CLI usage for quick testing
|
| 54 |
+
|
| 55 |
+
## Requirements
|
| 56 |
+
|
| 57 |
+
You will need:
|
| 58 |
+
|
| 59 |
+
- Python 3.10+
|
| 60 |
+
- A working PyTorch environment
|
| 61 |
+
- the dependencies listed in `requirements.txt`
|
| 62 |
+
|
| 63 |
+
A GPU is recommended, but CPU mode is supported if your environment can handle the runtime cost.
|
| 64 |
+
|
| 65 |
+
## Quick start
|
| 66 |
+
|
| 67 |
+
### 1) Install dependencies
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
pip install -r requirements.txt
|
| 71 |
+
````
|
| 72 |
+
|
| 73 |
+
### 2) Download or place `tts.pt`
|
| 74 |
+
|
| 75 |
+
If you are using the Hugging Face repo, download the bundle and place it next to `PackedTTS.py`, or pass the path with `--bundle`.
|
| 76 |
+
|
| 77 |
+
### 3) List available voices and emotions
|
| 78 |
+
|
| 79 |
+
```bash
|
| 80 |
+
python PackedTTS.py --bundle tts.pt --list
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### 4) Generate speech
|
| 84 |
+
|
| 85 |
+
```bash
|
| 86 |
+
python PackedTTS.py \
|
| 87 |
+
--bundle tts.pt \
|
| 88 |
+
--text "Hello world, this is a test." \
|
| 89 |
+
--voice "Sarah" \
|
| 90 |
+
--emotion "Angry" \
|
| 91 |
+
--output output.wav
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Reference-audio overrides
|
| 95 |
+
|
| 96 |
+
You can override the packed voice or emotion at runtime using reference audio:
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
python PackedTTS.py \
|
| 100 |
+
--bundle tts.pt \
|
| 101 |
+
--text "Hi, this is a custom test." \
|
| 102 |
+
--voice-ref path/to/voice_reference.wav \
|
| 103 |
+
--emo-ref path/to/emotion_reference.wav \
|
| 104 |
+
--output output.wav
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Python usage
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
from pathlib import Path
|
| 111 |
+
import soundfile as sf
|
| 112 |
+
|
| 113 |
+
from PackedTTS import PackedTTS
|
| 114 |
+
|
| 115 |
+
tts = PackedTTS.load(Path("tts.pt"))
|
| 116 |
+
|
| 117 |
+
sr, audio, meta = tts.generate(
|
| 118 |
+
text="Hi, this is Sarah speaking with a disgust emotion.",
|
| 119 |
+
voice="Sarah",
|
| 120 |
+
emotion="Disgust",
|
| 121 |
+
cfg_weight=0.5,
|
| 122 |
+
temperature=0.8,
|
| 123 |
+
exaggeration=0.5,
|
| 124 |
+
seed=42,
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
sf.write("output.wav", audio, sr)
|
| 128 |
+
print(meta)
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
## How it works
|
| 132 |
+
|
| 133 |
+
PackedTTS restores the full runtime bundle and then runs synthesis in three stages:
|
| 134 |
+
|
| 135 |
+
1. **Resolve voice and emotion**
|
| 136 |
+
|
| 137 |
+
* The bundle stores named voices and emotions.
|
| 138 |
+
* A voice can be selected by name or replaced with reference audio.
|
| 139 |
+
* An emotion can be selected by name or replaced with reference audio.
|
| 140 |
+
|
| 141 |
+
2. **Build conditionals**
|
| 142 |
+
|
| 143 |
+
* The runtime loads the packed speaker embedding.
|
| 144 |
+
* It loads the packed prompt tokens if available.
|
| 145 |
+
* It loads the emotion conditioning vector.
|
| 146 |
+
* It uses any packed generation reference state stored in the voice entry.
|
| 147 |
+
|
| 148 |
+
3. **Generate audio**
|
| 149 |
+
|
| 150 |
+
* `T3` generates speech tokens from text.
|
| 151 |
+
* `S3Gen` converts those tokens into waveform audio.
|
| 152 |
+
|
| 153 |
+
The result is a single packed synthesis workflow that does not require rebuilding the voice/emotion registry at runtime.
|
| 154 |
+
|
| 155 |
+
## Expected file behavior
|
| 156 |
+
|
| 157 |
+
The script expects the bundle to contain:
|
| 158 |
+
|
| 159 |
+
* `models.t3_state`
|
| 160 |
+
* `models.s3gen_state`
|
| 161 |
+
* `models.ve_state`
|
| 162 |
+
* `models.tokenizer_json`
|
| 163 |
+
* `voices`
|
| 164 |
+
* `emotions`
|
| 165 |
+
* `defaults`
|
| 166 |
+
* `indexes`
|
| 167 |
+
|
| 168 |
+
If a voice or emotion is not found by exact name, PackedTTS will try normalized matching and then fuzzy matching.
|
| 169 |
+
|
| 170 |
+
## Example command-line options
|
| 171 |
+
|
| 172 |
+
* `--bundle` β path to `tts.pt`
|
| 173 |
+
* `--text` β text to synthesize
|
| 174 |
+
* `--voice` β packed voice name
|
| 175 |
+
* `--emotion` β packed emotion name
|
| 176 |
+
* `--voice-ref` β override voice with reference audio
|
| 177 |
+
* `--emo-ref` β override emotion with reference audio
|
| 178 |
+
* `--cfg-weight` β classifier-free guidance weight
|
| 179 |
+
* `--temperature` β sampling temperature
|
| 180 |
+
* `--exaggeration` β emotion strength / style strength
|
| 181 |
+
* `--seed` β random seed
|
| 182 |
+
* `--output` β output WAV path
|
| 183 |
+
* `--list` β print packed voices and emotions
|
| 184 |
+
|
| 185 |
+
## Notes
|
| 186 |
+
|
| 187 |
+
* This repo is meant for inference and testing.
|
| 188 |
+
* The bundle is treated as a trusted artifact.
|
| 189 |
+
* If the underlying model architecture, tokenizer, or conditioning schema changes, rebuild `tts.pt`.
|
| 190 |
+
* Voice and emotion names depend on the bundle version.
|
| 191 |
+
|
| 192 |
+
## Hugging Face usage
|
| 193 |
+
|
| 194 |
+
This repo is designed to work well as a Hugging Face model repository and as a Hugging Face Space backend:
|
| 195 |
+
|
| 196 |
+
* the Space can install dependencies from `requirements.txt`
|
| 197 |
+
* the Space can download `tts.pt` automatically
|
| 198 |
+
* the runtime can load `PackedTTS.py`
|
| 199 |
+
* users can generate speech without rebuilding the bundle
|
| 200 |
+
|
| 201 |
+
## Credits
|
| 202 |
+
|
| 203 |
+
Built on top of:
|
| 204 |
+
|
| 205 |
+
* T3
|
| 206 |
+
* S3Gen
|
| 207 |
+
* VoiceEncoder
|
requirements.txt
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch==2.8.0
|
| 2 |
+
numpy==2.0.2
|
| 3 |
+
librosa==0.11.0
|
| 4 |
+
soundfile==0.13.1
|
| 5 |
+
scipy==1.13.1
|
| 6 |
+
chichat==0.0.4
|