Add model card
#1
by
gauthelo
- opened
README.md
CHANGED
|
@@ -1,3 +1,84 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
metrics:
|
| 4 |
+
- cer
|
| 5 |
+
- wer
|
| 6 |
+
library_name: speechbrain
|
| 7 |
+
pipeline_tag: automatic-speech-recognition
|
| 8 |
+
tags:
|
| 9 |
+
- speech processing
|
| 10 |
+
- self-supervision
|
| 11 |
+
- african languages
|
| 12 |
+
- fine-tuning
|
| 13 |
+
---
|
| 14 |
+
## Model description
|
| 15 |
+
This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k) is based on a HuBERT Base architecture (~95M params) [1].
|
| 16 |
+
It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa.
|
| 17 |
+
|
| 18 |
+
### Pretraining data
|
| 19 |
+
- Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech).
|
| 20 |
+
|
| 21 |
+
- Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).
|
| 22 |
+
|
| 23 |
+
## ASR fine-tuning
|
| 24 |
+
The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model.
|
| 25 |
+
Fine-tuning is done for each language using the FLEURS dataset [2].
|
| 26 |
+
The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top.
|
| 27 |
+
|
| 28 |
+
## License
|
| 29 |
+
This model is released under the CC-by-NC 4.0 conditions.
|
| 30 |
+
|
| 31 |
+
## Publication
|
| 32 |
+
This model were presented at AfricaNLP 2024.
|
| 33 |
+
The associated paper is available here: [Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context](https://openreview.net/forum?id=zLOhcft2E7)
|
| 34 |
+
|
| 35 |
+
### Citation
|
| 36 |
+
Please cite our paper when using SSA-HuBERT-base-60k model:
|
| 37 |
+
|
| 38 |
+
Caubrière, A., & Gauthier, E. (2024). Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context. In 5th Workshop on African Natural Language Processing (AfricaNLP 2024).
|
| 39 |
+
|
| 40 |
+
**Bibtex citation:**
|
| 41 |
+
@inproceedings{caubri{\`e}re2024ssaspeechssl,
|
| 42 |
+
title={Africa-Centric Self-Supervised Pretraining for Multilingual Speech Representation in a Sub-Saharan Context},
|
| 43 |
+
author={Antoine Caubri{\`e}re and Elodie Gauthier},
|
| 44 |
+
booktitle={5th Workshop on African Natural Language Processing},
|
| 45 |
+
year={2024},
|
| 46 |
+
url={https://openreview.net/forum?id=zLOhcft2E7}}
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## Results
|
| 50 |
+
The following results are obtained in a greedy mode (no language model rescoring).
|
| 51 |
+
Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.
|
| 52 |
+
|
| 53 |
+
| **Language** | **CER** | **WER** |
|
| 54 |
+
| :----------------- | :--------- | :--------- |
|
| 55 |
+
| **Afrikaans** | 23.3 | 68.4 |
|
| 56 |
+
| **Amharic** | 15.9 | 52.7 |
|
| 57 |
+
| **Fula** | 21.2 | 61.9 |
|
| 58 |
+
| **Ganda** | 11.5 | 52.8 |
|
| 59 |
+
| **Hausa** | 10.5 | 32.5 |
|
| 60 |
+
| **Igbo** | 19.7 | 57.5 |
|
| 61 |
+
| **Kamba** | 16.1 | 53.9 |
|
| 62 |
+
| **Lingala** | 8.7 | 24.7 |
|
| 63 |
+
| **Luo** | 9.9 | 38.9 |
|
| 64 |
+
| **Northen-Sotho** | 13.5 | 43.2 |
|
| 65 |
+
| **Nyanja** | 13.3 | 54.2 |
|
| 66 |
+
| **Oromo** | 22.8 | 78.1 |
|
| 67 |
+
| **Shona** | 11.6 | 50.2 |
|
| 68 |
+
| **Somali** | 21.6 | 64.9 |
|
| 69 |
+
| **Swahili** | 7.1 | 23.8 |
|
| 70 |
+
| **Umbundu** | 21.7 | 61.7 |
|
| 71 |
+
| **Wolof** | 19.4 | 55.0 |
|
| 72 |
+
| **Xhosa** | 11.9 | 51.6 |
|
| 73 |
+
| **Yoruba** | 24.3 | 67.5 |
|
| 74 |
+
| **Zulu** | 12.2 | 53.4 |
|
| 75 |
+
| *Overall average* | *15.8* | *52.3* |
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
## Reproductibilty
|
| 79 |
+
We propose a notebook to reproduce the ASR experiments mentioned in our paper. See `SB_ASR_FLEURS_finetuning.ipynb`.
|
| 80 |
+
By using the `ASR_FLEURS-swahili_hf.yaml` config file, you will be able to run the recipe on Swahili.
|
| 81 |
+
|
| 82 |
+
## References
|
| 83 |
+
[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.
|
| 84 |
+
[2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.
|