|
|
--- |
|
|
language: |
|
|
- ar |
|
|
license: apache-2.0 |
|
|
base_model: sesame/csm-1b |
|
|
tags: |
|
|
- speech-synthesis |
|
|
- text-to-speech |
|
|
- arabic |
|
|
- conversational-speech |
|
|
- csm |
|
|
- sesame |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# Seasme CSM Fine-Tuned on Common Voice 17 Arabic |
|
|
|
|
|
This model is a fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) on the Arabic subset of Common Voice 17.0 dataset. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis. |
|
|
|
|
|
The model did show learning the new language and is showing some encouraging signs. |
|
|
However performance is was below average, whcich was expected due to noise in the Common Voice 17 dataset which can use more pre-procesing for better results |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Dataset**: Mozilla Common Voice 17.0 (Arabic subset) |
|
|
- **Language**: Arabic (ar) |
|
|
|
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
After running 15 sweep runs with different hyperparameters, the following were the best performing ones: |
|
|
|
|
|
- **Batch Size**: 24 |
|
|
- **Learning Rate**: 3e-6 |
|
|
- **Epochs**: 25 |
|
|
- **Optimizer**: AdamW with exponential LR decay |
|
|
- **Weight Decay**: 0.014182 |
|
|
- **Max Gradient Norm**: 2.923641 |
|
|
- **Warmup Steps**: 569 |
|
|
- **Gradient Accumulation Steps**: 1 |
|
|
- **Decoder Loss Weight**: 0.5 |
|
|
- **Mixed Precision**: Enabled (AMP) |
|
|
|
|
|
### Training Configuration |
|
|
```yaml |
|
|
batch_size: 24 |
|
|
decoder_loss_weight: 0.5 |
|
|
device: "cuda" |
|
|
gen_every: 2000 |
|
|
gen_speaker: 999 |
|
|
grad_acc_steps: 1 |
|
|
learning_rate: 0.000003 |
|
|
log_every: 10 |
|
|
lr_decay: "exponential" |
|
|
max_grad_norm: 2.923641 |
|
|
n_epochs: 25 |
|
|
partial_data_loading: false |
|
|
save_every: 2000 |
|
|
train_from_scratch: false |
|
|
use_amp: true |
|
|
val_every: 200 |
|
|
warmup_steps: 569 |
|
|
weight_decay: 0.014182 |
|
|
``` |
|
|
|
|
|
### Generation Sample |
|
|
The model was tested with the following Arabic text during training: |
|
|
> "في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل." |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Backbone**: LLaMA-1B based architecture |
|
|
- **Decoder**: LLaMA-100M based decoder |
|
|
- **Audio Codebooks**: 32 |
|
|
- **Audio Vocabulary Size**: 2,051 |
|
|
- **Text Vocabulary Size**: 128,256 |
|
|
|
|
|
## Usage |
|
|
|
|
|
Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI |
|
|
You need at least 8GB VRAM to run the model. |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- This model is specifically trained for Arabic speech synthesis |
|
|
- Performance may vary with different Arabic dialects |
|
|
- The model inherits any biases present in the Common Voice 17.0 Arabic dataset |
|
|
|
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Original CSM model by Sesame team |
|
|
- Mozilla Foundation for the Common Voice dataset |
|
|
- HuggingFace for the model hosting platform |
|
|
- Modal labs for the compute |
|
|
|