HiMind commited on
Commit
e2a5429
Β·
verified Β·
1 Parent(s): 4c1b004

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +204 -0
  2. requirements.txt +6 -0
README.md CHANGED
@@ -1,3 +1,207 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-to-speech
7
+ - tts
8
+ - voice-cloning
9
+ - emotion-control
10
+ - speech-synthesis
11
+ - packed-model
12
+ - pytorch
13
  ---
14
+
15
+ # PackedTTS
16
+
17
+ PackedTTS is a self-contained text-to-speech runtime bundle that packages the full synthesis stack into a single `tts.pt` file.
18
+
19
+ The bundle is designed to be loaded directly by the runtime script and used without rebuilding the model stack. It stores the model weights, tokenizer data, packed voices, packed emotions, resolution indexes, and runtime defaults in one artifact.
20
+
21
+ The example bundle in this repo is intended to be used as-is and currently includes a voice set and emotion set.
22
+
23
+ ## What is included
24
+
25
+ `tts.pt` contains:
26
+
27
+ - T3 weights
28
+ - S3Gen weights
29
+ - VoiceEncoder weights
30
+ - tokenizer JSON
31
+ - packed voices
32
+ - packed emotions
33
+ - lookup indexes
34
+ - default resolution settings
35
+
36
+ This is not a training checkpoint meant to be unpacked and rebuilt from ingredients. It is the runtime artifact.
37
+
38
+ ## Repository contents
39
+
40
+ - `tts.pt` β€” packed TTS bundle
41
+ - `PackedTTS.py` β€” runtime loader, resolver, and inference script
42
+ - `requirements.txt` β€” Python dependencies for the runtime
43
+ - `README.md` β€” usage and overview
44
+
45
+ ## Features
46
+
47
+ - Single-file bundle loading
48
+ - Voice selection by name
49
+ - Emotion selection by name
50
+ - Fuzzy matching for names
51
+ - Optional reference-audio overrides
52
+ - Packed voices and emotions inside one artifact
53
+ - CLI usage for quick testing
54
+
55
+ ## Requirements
56
+
57
+ You will need:
58
+
59
+ - Python 3.10+
60
+ - A working PyTorch environment
61
+ - the dependencies listed in `requirements.txt`
62
+
63
+ A GPU is recommended, but CPU mode is supported if your environment can handle the runtime cost.
64
+
65
+ ## Quick start
66
+
67
+ ### 1) Install dependencies
68
+
69
+ ```bash
70
+ pip install -r requirements.txt
71
+ ````
72
+
73
+ ### 2) Download or place `tts.pt`
74
+
75
+ If you are using the Hugging Face repo, download the bundle and place it next to `PackedTTS.py`, or pass the path with `--bundle`.
76
+
77
+ ### 3) List available voices and emotions
78
+
79
+ ```bash
80
+ python PackedTTS.py --bundle tts.pt --list
81
+ ```
82
+
83
+ ### 4) Generate speech
84
+
85
+ ```bash
86
+ python PackedTTS.py \
87
+ --bundle tts.pt \
88
+ --text "Hello world, this is a test." \
89
+ --voice "Sarah" \
90
+ --emotion "Angry" \
91
+ --output output.wav
92
+ ```
93
+
94
+ ## Reference-audio overrides
95
+
96
+ You can override the packed voice or emotion at runtime using reference audio:
97
+
98
+ ```bash
99
+ python PackedTTS.py \
100
+ --bundle tts.pt \
101
+ --text "Hi, this is a custom test." \
102
+ --voice-ref path/to/voice_reference.wav \
103
+ --emo-ref path/to/emotion_reference.wav \
104
+ --output output.wav
105
+ ```
106
+
107
+ ## Python usage
108
+
109
+ ```python
110
+ from pathlib import Path
111
+ import soundfile as sf
112
+
113
+ from PackedTTS import PackedTTS
114
+
115
+ tts = PackedTTS.load(Path("tts.pt"))
116
+
117
+ sr, audio, meta = tts.generate(
118
+ text="Hi, this is Sarah speaking with a disgust emotion.",
119
+ voice="Sarah",
120
+ emotion="Disgust",
121
+ cfg_weight=0.5,
122
+ temperature=0.8,
123
+ exaggeration=0.5,
124
+ seed=42,
125
+ )
126
+
127
+ sf.write("output.wav", audio, sr)
128
+ print(meta)
129
+ ```
130
+
131
+ ## How it works
132
+
133
+ PackedTTS restores the full runtime bundle and then runs synthesis in three stages:
134
+
135
+ 1. **Resolve voice and emotion**
136
+
137
+ * The bundle stores named voices and emotions.
138
+ * A voice can be selected by name or replaced with reference audio.
139
+ * An emotion can be selected by name or replaced with reference audio.
140
+
141
+ 2. **Build conditionals**
142
+
143
+ * The runtime loads the packed speaker embedding.
144
+ * It loads the packed prompt tokens if available.
145
+ * It loads the emotion conditioning vector.
146
+ * It uses any packed generation reference state stored in the voice entry.
147
+
148
+ 3. **Generate audio**
149
+
150
+ * `T3` generates speech tokens from text.
151
+ * `S3Gen` converts those tokens into waveform audio.
152
+
153
+ The result is a single packed synthesis workflow that does not require rebuilding the voice/emotion registry at runtime.
154
+
155
+ ## Expected file behavior
156
+
157
+ The script expects the bundle to contain:
158
+
159
+ * `models.t3_state`
160
+ * `models.s3gen_state`
161
+ * `models.ve_state`
162
+ * `models.tokenizer_json`
163
+ * `voices`
164
+ * `emotions`
165
+ * `defaults`
166
+ * `indexes`
167
+
168
+ If a voice or emotion is not found by exact name, PackedTTS will try normalized matching and then fuzzy matching.
169
+
170
+ ## Example command-line options
171
+
172
+ * `--bundle` β€” path to `tts.pt`
173
+ * `--text` β€” text to synthesize
174
+ * `--voice` β€” packed voice name
175
+ * `--emotion` β€” packed emotion name
176
+ * `--voice-ref` β€” override voice with reference audio
177
+ * `--emo-ref` β€” override emotion with reference audio
178
+ * `--cfg-weight` β€” classifier-free guidance weight
179
+ * `--temperature` β€” sampling temperature
180
+ * `--exaggeration` β€” emotion strength / style strength
181
+ * `--seed` β€” random seed
182
+ * `--output` β€” output WAV path
183
+ * `--list` β€” print packed voices and emotions
184
+
185
+ ## Notes
186
+
187
+ * This repo is meant for inference and testing.
188
+ * The bundle is treated as a trusted artifact.
189
+ * If the underlying model architecture, tokenizer, or conditioning schema changes, rebuild `tts.pt`.
190
+ * Voice and emotion names depend on the bundle version.
191
+
192
+ ## Hugging Face usage
193
+
194
+ This repo is designed to work well as a Hugging Face model repository and as a Hugging Face Space backend:
195
+
196
+ * the Space can install dependencies from `requirements.txt`
197
+ * the Space can download `tts.pt` automatically
198
+ * the runtime can load `PackedTTS.py`
199
+ * users can generate speech without rebuilding the bundle
200
+
201
+ ## Credits
202
+
203
+ Built on top of:
204
+
205
+ * T3
206
+ * S3Gen
207
+ * VoiceEncoder
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ torch==2.8.0
2
+ numpy==2.0.2
3
+ librosa==0.11.0
4
+ soundfile==0.13.1
5
+ scipy==1.13.1
6
+ chichat==0.0.4