tarun7r commited on
Commit
efedef6
·
verified ·
1 Parent(s): cf42aee

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ figures/Fig1.png filter=lfs diff=lfs merge=lfs -text
37
+ demo.wav filter=lfs diff=lfs merge=lfs -text
38
+ hi-Priya_woman.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - hi
5
+ - en
6
+ base_model: vibevoice/VibeVoice-7B
7
+ tags:
8
+ - text-to-speech
9
+ - hindi
10
+ - lora
11
+ - peft
12
+ - audio-generation
13
+ - tts
14
+ pipeline_tag: text-to-speech
15
+ ---
16
+
17
+ # VibeVoice-Hindi-LoRA
18
+
19
+ ## Model Description
20
+
21
+ This repository contains **LoRA (Low-Rank Adaptation) weights** for fine-tuning the VibeVoice-7B model for **Hindi text-to-speech synthesis**. These adapter weights enable efficient fine-tuning of the base VibeVoice model to generate high-quality, natural-sounding Hindi speech without requiring a full model retrain.
22
+
23
+ LoRA is a parameter-efficient fine-tuning technique that adds trainable rank decomposition matrices to the model while keeping the original pre-trained weights frozen. This results in significantly smaller model sizes and faster training times.
24
+
25
+ ### Base Model
26
+ - **Base Model:** [vibevoice/VibeVoice-7B](https://huggingface.co/vibevoice/VibeVoice-7B)
27
+ - **Architecture:** Qwen2.5-7B + Acoustic/Semantic Tokenizers + Diffusion Head
28
+ - **Original Training:** English and Chinese speech synthesis
29
+
30
+ ### Fine-tuning Details
31
+ - **Target Language:** Hindi
32
+ - **Method:** LoRA (Low-Rank Adaptation)
33
+ - **Fine-tuned Components:**
34
+ - LLM backbone (LoRA adapters)
35
+ - Diffusion head (full fine-tuning)
36
+ - Acoustic and Semantic connectors
37
+
38
+ ## What's Included
39
+
40
+ This repository contains:
41
+ - `adapter_config.json` - LoRA configuration file
42
+ - `adapter_model.safetensors` - LoRA adapter weights for the LLM backbone
43
+ - `diffusion_head/` - Full diffusion head weights fine-tuned for Hindi
44
+ - `acoustic_connector/` - Acoustic connector weights
45
+ - `semantic_connector/` - Semantic connector weights
46
+
47
+ ## Usage
48
+
49
+ ### Demo and Inference Code
50
+
51
+ For complete inference examples and demos, please refer to:
52
+
53
+ - **Community Repository:** [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice)
54
+ - **ComfyUI Integration:** [Enemyx-net/VibeVoice-ComfyUI](https://github.com/Enemyx-net/VibeVoice-ComfyUI)
55
+
56
+ ### Hindi Inference
57
+
58
+ ### Using with VibeVoice Inference Pipeline
59
+
60
+ ```bash
61
+ # Clone the community repository
62
+ git clone https://github.com/vibevoice-community/VibeVoice.git
63
+ cd VibeVoice
64
+
65
+ # Install dependencies
66
+ uv pip install -e .
67
+ ```
68
+
69
+ #### With voice cloning (Recommended for Hindi models):
70
+ ```bash
71
+ python demo/inference_from_file.py \
72
+ --checkpoint_path "./vibevoice-hindi-lora" \
73
+ --txt_path "./example_hindi_script.txt" \
74
+ --model_path "vibevoice/VibeVoice-7B" \
75
+ --speaker_names hi-Priya_woman \
76
+ --seed 42
77
+ ```
78
+
79
+ #### With multiple speakers:
80
+ ```bash
81
+ python demo/inference_from_file.py \
82
+ --checkpoint_path "./vibevoice-hindi-lora" \
83
+ --txt_path "./example_hindi_script.txt" \
84
+ --model_path "vibevoice/VibeVoice-7B" \
85
+ --speaker_names "Speaker1" "Speaker2" \
86
+ --cfg_scale 1.3
87
+ ```
88
+
89
+ **Note**: For voice cloning, ensure you have corresponding voice files in `demo/voices/` directory. The script will automatically map speaker names to voice files.
90
+
91
+ **Key points for Hindi inference:**
92
+ - **With voice cloning**: Specify `--speaker_names` to map speakers to voice files
93
+ - Use `--model_path "vibevoice/VibeVoice-7B"` to match your checkpoint size
94
+ - The model will use the provided voice samples for generation
95
+ - Voice samples are loaded and used for voice cloning
96
+
97
+ **Voice Cloning Setup:**
98
+ - Place voice sample files in `demo/voices/` directory
99
+ - **Required file**: `hi-Priya_woman.wav` - Hindi female voice sample
100
+ - Use descriptive filenames like `hindi-speaker1.wav`, `hindi-speaker2.wav`
101
+ - The script will automatically map speaker names to voice files
102
+ - Voice cloning works best with high-quality, clear voice samples
103
+
104
+ **Model Architecture Compatibility:**
105
+ - Ensure your checkpoint matches the model size (7B checkpoint requires `--model_path "vibevoice/VibeVoice-7B"`)
106
+
107
+ ## Hindi Inference with Gradio Demo
108
+
109
+ For interactive Hindi text generation with a model:
110
+
111
+ ### Launch the Gradio Demo:
112
+ ```bash
113
+ python demo/gradio_demo.py \
114
+ --model_path "vibevoice/VibeVoice-7B" \
115
+ --checkpoint_path "./vibevoice-hindi-lora" \
116
+ --device cuda
117
+ ```
118
+
119
+ ### Using the Web Interface:
120
+ 1. Enter your Hindi script in the text area
121
+ 2. Select speakers (use `hi-Priya_woman` for Hindi voice)
122
+ 3. Click "🚀 Generate Podcast"
123
+
124
+ **Key points:**
125
+ - Voice samples are loaded and used for voice cloning
126
+ - The model will use the provided voice samples for generation
127
+ - Real-time streaming audio generation is supported
128
+ - Works with both 1.5B and 7B models (ensure checkpoint matches model size)
129
+ - Make sure `hi-Priya_woman.wav` is in the `demo/voices/` directory
130
+
131
+ ## Demo
132
+
133
+ ### Sample Output:
134
+ <audio controls src="https://huggingface.co/tarun7r/vibevoice-hindi-lora/resolve/main/demo.wav" style="width: 100%;"></audio>
135
+
136
+ **Important Note:** The quality of the generated audio depends heavily on the reference voice file you provide in the `demo/voices/` directory. For best results:
137
+ - Use high-quality, clear voice samples
138
+ - Ensure the reference voice matches the desired speaking style
139
+ - Longer reference samples (10-30 seconds) generally produce better results
140
+ - The voice characteristics of the reference sample will be transferred to the generated speech
141
+
142
+ ## Model Capabilities
143
+
144
+ - **Text-to-Speech:** Convert Hindi text to natural-sounding speech
145
+ - **Multi-speaker Support:** Generate speech with multiple distinct speakers
146
+ - **Long-form Audio:** Synthesize extended audio sequences
147
+ - **Expressive Speech:** Maintain natural prosody and intonation for Hindi
148
+
149
+ ## Responsible Usage
150
+
151
+ ### Direct intended uses
152
+ The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the tech report.
153
+
154
+ ### Out-of-scope uses
155
+ Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:
156
+
157
+ - Voice impersonation without explicit, recorded consent – cloning a real individual's voice for satire, advertising, ransom, social‑engineering, or authentication bypass.
158
+ - Disinformation or impersonation – creating audio presented as genuine recordings of real people or events.
159
+ - Real‑time or low‑latency voice conversion – telephone or video‑conference "live deep‑fake" applications.
160
+ - Unsupported language – the model is trained only on English and Chinese data; outputs in other languages are unsupported and may be unintelligible or offensive.
161
+ - Generation of background ambience, Foley, or music – VibeVoice is speech‑only and will not produce coherent non‑speech audio.
162
+
163
+ ## Risks and limitations
164
+ While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model. Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content. English and Chinese only: Transcripts in language other than English or Chinese may result in unexpected audio outputs. Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects. Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
165
+
166
+ ## Recommendations
167
+ We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
168
+
169
+ To mitigate the risks of misuse, we have: Embedded an audible disclaimer (e.g. "This segment was generated by AI") automatically into every synthesized audio file. Added an imperceptible watermark to generated audio so third parties can verify VibeVoice provenance. Please see contact information at the end of this model card. Logged inference requests (hashed) for abuse pattern detection and publishing aggregated statistics quarterly. Users are responsible for sourcing their datasets legally and ethically. This may include securing appropriate rights and/or anonymizing data prior to use with VibeVoice. Users are reminded to be mindful of data privacy concerns.
170
+
171
+ ## License & Redistribution Notice
172
+
173
+ This model is released under the **MIT License**, consistent with the base VibeVoice model.
174
+
175
+ **Redistribution Notice:**
176
+ This repository contains model weights derived from [microsoft/VibeVoice-Large](https://www.modelscope.cn/models/microsoft/VibeVoice-Large), which is licensed under the MIT License. The MIT License permits redistribution and derivative works.
177
+
178
+ My understanding of the MIT License, which is consistent with the broader open-source community's consensus, is that it grants the right to distribute copies of the software and its derivatives. Therefore, I am lawfully exercising the right to redistribute this model.
179
+
180
+ If you are a rights holder and believe this understanding of the license is incorrect, please submit a DMCA complaint to Hugging Face at [email protected]
181
+
182
+
183
+ ## Acknowledgments
184
+
185
+ - **Base Model:** Microsoft Research for the original VibeVoice model
186
+ - **Fine-tuning Code:** [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) for the training framework
187
+ - **Training Infrastructure:** [Nebius](https://lightning.ai/pricing/) H100 GPU cluster
188
+ - **Community:** Hugging Face and the open-source AI community
189
+ - **Framework:** Built on Qwen2.5, Transformers, and PEFT libraries
190
+
191
+ ## Contact
192
+
193
+ **Actively seeking opportunities as an ML Engineer II / Data Scientist II**
194
+
195
+ For questions, issues, or collaboration:
196
+ - **GitHub:** [tarun7r](https://github.com/tarun7r)
197
+ - **Hugging Face:** [tarun7r](https://huggingface.co/tarun7r)
198
+ - **Base model contact:** [email protected]
199
+
200
+ ### Other Key Projects
201
+ - **[SpeechAlgo](https://github.com/tarun7r/SpeechAlgo)** - Comprehensive Speech Processing Algorithms Library
202
+ - **[Vocal-Agent](https://github.com/tarun7r/Vocal-Agent)** - Cascading voice assistant with real-time speech recognition
203
+ - **[Finance-Llama-8B](https://huggingface.co/tarun7r/Finance-Llama-8B)** - Financial domain fine-tuned Llama model
204
+
205
+ ---
206
+
207
+ **Note:** This is a research model. Please use responsibly and in compliance with applicable laws and ethical guidelines.
acoustic_connector/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3339274373c7b1becac4dad05ab36fd4119bc742d9dfe3773215e9840a68c2fa
3
+ size 26173211
adapter_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "qalora_group_size": 16,
24
+ "r": 16,
25
+ "rank_pattern": {},
26
+ "revision": null,
27
+ "target_modules": [
28
+ "gate_proj",
29
+ "v_proj",
30
+ "k_proj",
31
+ "o_proj",
32
+ "q_proj",
33
+ "down_proj",
34
+ "up_proj"
35
+ ],
36
+ "target_parameters": null,
37
+ "task_type": "CAUSAL_LM",
38
+ "trainable_token_indices": null,
39
+ "use_dora": false,
40
+ "use_qalora": false,
41
+ "use_rslora": false
42
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a3f1fb3e3e9929b90e80c412e8279825597dd688b86b5ce1bb9fd4609d87d10
3
+ size 161530840
demo.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:50c26c8b6532558303ebb3c39d8ae96bf65b575b666147b0e0b73a0ee96d956b
3
+ size 2361644
diffusion_head/config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "VibeVoiceDiffusionHead"
4
+ ],
5
+ "ddpm_batch_mul": 4,
6
+ "ddpm_beta_schedule": "cosine",
7
+ "ddpm_num_inference_steps": 20,
8
+ "ddpm_num_steps": 1000,
9
+ "diffusion_type": "ddpm",
10
+ "head_ffn_ratio": 3.0,
11
+ "head_layers": 4,
12
+ "hidden_size": 3584,
13
+ "latent_size": 64,
14
+ "model_type": "vibevoice_diffusion_head",
15
+ "prediction_type": "v_prediction",
16
+ "rms_norm_eps": 1e-05,
17
+ "speech_vae_dim": 64,
18
+ "torch_dtype": "bfloat16",
19
+ "transformers_version": "4.51.3"
20
+ }
diffusion_head/diffusion_head_full.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc3d8f80c59ee7e89bc18751bbf124a2f46c79d040c97c3edcb8fb7aa28432ab
3
+ size 1338678485
diffusion_head/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3f155080fead8b2a06852f3016280e80191651b779bb54b15b53c8e1d897084
3
+ size 1338669752
diffusion_head_full.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc3d8f80c59ee7e89bc18751bbf124a2f46c79d040c97c3edcb8fb7aa28432ab
3
+ size 1338678485
hi-Priya_woman.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:343395d24039183143b0ab79530d4f62ff87f46ec9aec72fed5164155321a73b
3
+ size 218046
semantic_connector/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cdfd59ae14d127eb099f57e78b6e5a475668672038a184195b41846ae3817ca3
3
+ size 26631963