samehelalfi
/

Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic

speech-synthesis

conversational-speech

Model card Files Files and versions

Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic / README.md

samehelalfi's picture

Upload folder using huggingface_hub

63f861a verified 5 months ago

|

history blame contribute delete

3.02 kB

	---
	language:
	- ar
	license: apache-2.0
	base_model: sesame/csm-1b
	tags:
	- speech-synthesis
	- text-to-speech
	- arabic
	- conversational-speech
	- csm
	- sesame
	datasets:
	- mozilla-foundation/common_voice_17_0
	pipeline_tag: text-to-speech
	---

	# Seasme CSM Fine-Tuned on Common Voice 17 Arabic

	This model is a fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) on the Arabic subset of Common Voice 17.0 dataset.

	## Model Description

	Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis.

	The model did show learning the new language and is showing some encouraging signs.
	However performance is was below average, whcich was expected due to noise in the Common Voice 17 dataset which can use more pre-procesing for better results

	## Training Details

	### Training Data
	- Dataset: Mozilla Common Voice 17.0 (Arabic subset)
	- Language: Arabic (ar)


	### Training Hyperparameters

	After running 15 sweep runs with different hyperparameters, the following were the best performing ones:

	- Batch Size: 24
	- Learning Rate: 3e-6
	- Epochs: 25
	- Optimizer: AdamW with exponential LR decay
	- Weight Decay: 0.014182
	- Max Gradient Norm: 2.923641
	- Warmup Steps: 569
	- Gradient Accumulation Steps: 1
	- Decoder Loss Weight: 0.5
	- Mixed Precision: Enabled (AMP)

	### Training Configuration
	```yaml
	batch_size: 24
	decoder_loss_weight: 0.5
	device: "cuda"
	gen_every: 2000
	gen_speaker: 999
	grad_acc_steps: 1
	learning_rate: 0.000003
	log_every: 10
	lr_decay: "exponential"
	max_grad_norm: 2.923641
	n_epochs: 25
	partial_data_loading: false
	save_every: 2000
	train_from_scratch: false
	use_amp: true
	val_every: 200
	warmup_steps: 569
	weight_decay: 0.014182
	```

	### Generation Sample
	The model was tested with the following Arabic text during training:
	> "في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل."

	## Model Architecture

	- Backbone: LLaMA-1B based architecture
	- Decoder: LLaMA-100M based decoder
	- Audio Codebooks: 32
	- Audio Vocabulary Size: 2,051
	- Text Vocabulary Size: 128,256

	## Usage

	Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI
	You need at least 8GB VRAM to run the model.

	## Limitations and Bias

	- This model is specifically trained for Arabic speech synthesis
	- Performance may vary with different Arabic dialects
	- The model inherits any biases present in the Common Voice 17.0 Arabic dataset


	## Acknowledgments

	- Original CSM model by Sesame team
	- Mozilla Foundation for the Common Voice dataset
	- HuggingFace for the model hosting platform
	- Modal labs for the compute