whisper-small-swh-finetuned
This model is a fine-tuned version of OpenAI's Whisper-small specifically for Swahili Automatic Speech Recognition (ASR). It was fine-tuned on the Swahili portion of the FLEURS-SLU dataset to significantly improve transcription accuracy for the Swahili language.
- Developed by: Daniel Amemba Odhiambo
- Model type: Whisper-small for Automatic Speech Recognition
- Language: Swahili (
swh,sw,Aswh) - License: MIT
Performance Highlights
The fine-tuned model achieves a 68% relative improvement in Word Error Rate (WER) over the base model on the Swahili test set.
| Model | WER (%) |
|---|---|
openai/whisper-small (baseline) |
103.10 |
adoamesh/whisper-small-swh-finetuned (this model) |
32.07 |
Model Description
This model is based on the openai/whisper-small architecture, which features a Transformer-based encoder-decoder structure. It has been specifically adapted for Swahili by continuing pre-training on Swahili audio data.
- Base Model: openai/whisper-small
- Fine-tuned for: 1000 steps
- Training Data: Swahili split of the FLEURS-SLU dataset (3.62 hours training, 0.60 hours testing)
Uses
Direct Use
This model is intended for transcribing Swahili speech to text. Primary use cases include:
- Transcription of Swahili audio and video content.
- Building speech-to-text applications for Swahili.
- Research in low-resource language speech recognition.
Downstream Use
The model can be fine-tuned further for specific domains or accents within Swahili.
Out-of-Scope Use
- Not recommended for real-time, production-critical systems without further evaluation and testing.
- Not recommended for transcribing other languages.
- Must not be used for surveillance, discriminatory purposes, or any application that violates human rights.
How to Get Started with the Model
Usage - Comparison with OpenAI Whisper Small ( Model)
TESTED ON WINDOWS 11:
TESTED ON WINDOWS 11: Click to expand comparison between the Whisper small vs the finetuned model: code
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small from Hugging Face Hub
finetuned_model = WhisperForConditionalGeneration.from_pretrained("adoamesh/whisper-small-swh-finetuned")
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()
# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()
# --- 3️⃣ Load audio file using librosa ---
audio_path = "Recording.wav"
try:
waveform, sample_rate = librosa.load(audio_path, sr=16000)
waveform = torch.from_numpy(waveform).float().unsqueeze(0)
except Exception as e:
print(f"Error loading audio with librosa: {e}")
exit(1)
# Prepare input features
input_features = processor(
waveform.numpy()[0],
sampling_rate=16000,
return_tensors="pt"
).input_features
# --- 4️⃣ Transcribe ---
with torch.no_grad():
# Fine-tuned
predicted_ids_finetuned = finetuned_model.generate(input_features)
transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]
# Original
predicted_ids_original = original_model.generate(input_features)
transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]
# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")
Transcriptions comparison between the Whisper small vs the finetuned model: code
Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda.
Original Whisper-Small (OpenAI): Nili mwabi ayule bindi kwa mba nampe dha.
Training Details
Training Data
The model was fine-tuned on the Swahili (sw) split of the FLEURS-SLU dataset.
| Dataset Split | Duration (hours) |
|---|---|
| Training | 3.62 |
| Test | 0.60 |
| Total | 4.22 |
Training Procedure
- Steps: 1000
- Learning Rate: 3e-5
- Batch Size: 16
- Warmup Steps: 200
- Gradient Checkpointing: Enabled
- Precision: FP16
Training Results
| Metric | Value |
|---|---|
| Final Training Loss | 0.0007 |
| Validation Loss | 0.8038 |
| Final WER | 32.07% |
| Training Runtime | ~2.3 hours |
| Total Epochs | 19.6 |
Evaluation
Results vs. Baseline
The primary metric for evaluation is Word Error Rate (WER) on the Swahili test set. The fine-tuned model shows a dramatic improvement.
| Model | WER (%) | Relative Improvement |
|---|---|---|
openai/whisper-small |
103.10 | - |
adoamesh/whisper-small-swh-finetuned |
32.07 | +68% |
Example Transcription
Audio: "Nilimwambia yule bibi kwamba nampenda." (I told that lady that I love her.)
| Model | Transcription |
|---|---|
| Fine-tuned Model | Ni limwambia yulebindi kwamba na mpenda. |
| Base Model | Nili mwabi aiyule bindi kwa mba nampe dha. |
The fine-tuned model produces a more accurate and natural-sounding transcription.
Limitations and Bias
- Dialectal Bias: Performance may vary across different Swahili dialects and accents not well-represented in the training data.
- Domain Specificity: The model may struggle with technical terminology, slang, or domain-specific vocabulary (e.g., medical, legal).
- Audio Quality: Performance will degrade with poor-quality, noisy, or low-fidelity audio recordings.
- Data Scale: The model was trained on a relatively small dataset (4.22 hours). Performance can likely be improved with more diverse and extensive Swahili audio data.
Environmental Impact
- Hardware Type: 1 x NVIDIA GPU (exact type not specified)
- Hours used: ~1.3 hours
Citation
If you use this model, please cite both the original Whisper paper and this model.
@misc{odhiambo2025whispersmallswahili,
author = {Odhiambo, Daniel Amemba},
title = {Whisper Small Swahili Fine-tuned},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/adoamesh/whisper-small-swh-finetuned}}
}
@misc{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year = {2022},
eprint = {2212.04356},
archivePrefix = {arXiv},
primaryClass = {eess.AS}
}
Acknowledgements
- OpenAI for the original Whisper model and architecture.
- The creators of the FLEURS-SLU dataset for providing the Swahili speech data.
- Hugging Face for the Transformers library and for hosting the model.
This model card was authored by Daniel Amemba Odhiambo and redesigned with the help of an AI assistant.
- Downloads last month
- 37
Model tree for adoamesh/whisper-small-swh-finetuned
Base model
openai/whisper-small