whisper-small-swh-finetuned

This model is a fine-tuned version of OpenAI's Whisper-small specifically for Swahili Automatic Speech Recognition (ASR). It was fine-tuned on the Swahili portion of the FLEURS-SLU dataset to significantly improve transcription accuracy for the Swahili language.

  • Developed by: Daniel Amemba Odhiambo
  • Model type: Whisper-small for Automatic Speech Recognition
  • Language: Swahili (swh,sw,Aswh)
  • License: MIT

Performance Highlights

The fine-tuned model achieves a 68% relative improvement in Word Error Rate (WER) over the base model on the Swahili test set.

Model WER (%)
openai/whisper-small (baseline) 103.10
adoamesh/whisper-small-swh-finetuned (this model) 32.07

Model Description

This model is based on the openai/whisper-small architecture, which features a Transformer-based encoder-decoder structure. It has been specifically adapted for Swahili by continuing pre-training on Swahili audio data.

  • Base Model: openai/whisper-small
  • Fine-tuned for: 1000 steps
  • Training Data: Swahili split of the FLEURS-SLU dataset (3.62 hours training, 0.60 hours testing)

Uses

Direct Use

This model is intended for transcribing Swahili speech to text. Primary use cases include:

  • Transcription of Swahili audio and video content.
  • Building speech-to-text applications for Swahili.
  • Research in low-resource language speech recognition.

Downstream Use

The model can be fine-tuned further for specific domains or accents within Swahili.

Out-of-Scope Use

  • Not recommended for real-time, production-critical systems without further evaluation and testing.
  • Not recommended for transcribing other languages.
  • Must not be used for surveillance, discriminatory purposes, or any application that violates human rights.

How to Get Started with the Model

Usage - Comparison with OpenAI Whisper Small ( Model)

TESTED ON WINDOWS 11:

TESTED ON WINDOWS 11: Click to expand comparison between the Whisper small vs the finetuned model: code
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small from Hugging Face Hub
finetuned_model = WhisperForConditionalGeneration.from_pretrained("adoamesh/whisper-small-swh-finetuned")
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()

# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()

# --- 3️⃣ Load audio file using librosa ---
audio_path = "Recording.wav"
try:
    waveform, sample_rate = librosa.load(audio_path, sr=16000)
    waveform = torch.from_numpy(waveform).float().unsqueeze(0)
except Exception as e:
    print(f"Error loading audio with librosa: {e}")
    exit(1)

# Prepare input features
input_features = processor(
    waveform.numpy()[0],
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# --- 4️⃣ Transcribe ---
with torch.no_grad():
    # Fine-tuned
    predicted_ids_finetuned = finetuned_model.generate(input_features)
    transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]

    # Original
    predicted_ids_original = original_model.generate(input_features)
    transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]

# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")
Transcriptions comparison between the Whisper small vs the finetuned model: code
Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda.
Original Whisper-Small (OpenAI):  Nili mwabi ayule bindi kwa mba nampe dha.

Training Details

Training Data

The model was fine-tuned on the Swahili (sw) split of the FLEURS-SLU dataset.

Dataset Split Duration (hours)
Training 3.62
Test 0.60
Total 4.22

Training Procedure

  • Steps: 1000
  • Learning Rate: 3e-5
  • Batch Size: 16
  • Warmup Steps: 200
  • Gradient Checkpointing: Enabled
  • Precision: FP16

Training Results

Metric Value
Final Training Loss 0.0007
Validation Loss 0.8038
Final WER 32.07%
Training Runtime ~2.3 hours
Total Epochs 19.6

Evaluation

Results vs. Baseline

The primary metric for evaluation is Word Error Rate (WER) on the Swahili test set. The fine-tuned model shows a dramatic improvement.

Model WER (%) Relative Improvement
openai/whisper-small 103.10 -
adoamesh/whisper-small-swh-finetuned 32.07 +68%

Example Transcription

Audio: "Nilimwambia yule bibi kwamba nampenda." (I told that lady that I love her.)

Model Transcription
Fine-tuned Model Ni limwambia yulebindi kwamba na mpenda.
Base Model Nili mwabi aiyule bindi kwa mba nampe dha.

The fine-tuned model produces a more accurate and natural-sounding transcription.

Limitations and Bias

  • Dialectal Bias: Performance may vary across different Swahili dialects and accents not well-represented in the training data.
  • Domain Specificity: The model may struggle with technical terminology, slang, or domain-specific vocabulary (e.g., medical, legal).
  • Audio Quality: Performance will degrade with poor-quality, noisy, or low-fidelity audio recordings.
  • Data Scale: The model was trained on a relatively small dataset (4.22 hours). Performance can likely be improved with more diverse and extensive Swahili audio data.

Environmental Impact

  • Hardware Type: 1 x NVIDIA GPU (exact type not specified)
  • Hours used: ~1.3 hours

Citation

If you use this model, please cite both the original Whisper paper and this model.

@misc{odhiambo2025whispersmallswahili,
  author = {Odhiambo, Daniel Amemba},
  title = {Whisper Small Swahili Fine-tuned},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/adoamesh/whisper-small-swh-finetuned}}
}
@misc{radford2022whisper,
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year = {2022},
  eprint = {2212.04356},
  archivePrefix = {arXiv},
  primaryClass = {eess.AS}
}

Acknowledgements


This model card was authored by Daniel Amemba Odhiambo and redesigned with the help of an AI assistant.

Downloads last month
37
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adoamesh/whisper-small-swh-finetuned

Finetuned
(3059)
this model