W2V-BERT 2.0 Multilingual ASR (6 Languages, 128-dim Adapters)

This model is a fine-tuned version of facebook/w2v-bert-2.0 for multilingual automatic speech recognition across 6 languages.

Supported Languages

Language Code Token
English eng <eng_Latn>
Swahili swh <swh_Latn>
Kikuyu kik <kik_Latn>
Kamba kam <kam_Latn>
Kimeru mer <mer_Latn>
Luo luo <luo_Latn>

Model Description

  • Architecture: V3 Hybrid (MMS-style Adapters + Decoder Block)
  • Base Model: facebook/w2v-bert-2.0
  • Task: Multilingual Automatic Speech Recognition (ASR)
  • Training: CTC loss with language token prefixes

Language Token Approach

This model uses language identification tokens prepended to transcripts:

  • Training: "<eng_Latn> hello world", "<swh_Latn> habari dunia"
  • Inference: Model predicts language token first, providing automatic language identification

Note: This model identifies the primary language per utterance. It does not support code-switching (multiple languages within a single utterance).

Evaluation Results

User-Facing WER (Recommended)

This is the WER you can expect when running inference on individual samples. It reflects real-world usage patterns.

Language WER
English (eng) 14.56%
Swahili (swh) 24.10%
Kikuyu (kik) 15.63%
Kamba (kam) 32.73%
Kimeru (mer) 33.94%
Luo (luo) 16.86%
Overall 21.36%

Training Configuration

Metric Value
Training Samples 90,000
Samples per Language 15,000
Epochs 10
Training Evaluation WER (for reference)

WER computed during training on batched evaluation. May differ from user-facing WER due to batch processing effects.

Language WER
English (eng) 21.80%
Swahili (swh) 21.22%
Kikuyu (kik) 22.08%
Kamba (kam) 20.81%
Kimeru (mer) 20.88%
Luo (luo) 21.43%
Overall 21.37%

Architecture Details

Component Value
Adapter Dimension 128
Decoder Layers 1
Vocabulary Size 43

Usage

from transformers import Wav2Vec2BertProcessor, Wav2Vec2BertForCTC
import torch

# Load model and processor
processor = Wav2Vec2BertProcessor.from_pretrained("mutisya/w2v-bert-v3Hybrid-6lang-e10-v4")
model = Wav2Vec2BertForCTC.from_pretrained("mutisya/w2v-bert-v3Hybrid-6lang-e10-v4")

# Transcribe audio (language-agnostic mode)
def transcribe(audio_array, sampling_rate=16000):
    inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]
    # Result will be like: "<eng_Latn> hello world" or "<swh_Latn> habari dunia"
    return transcription

# To remove language token prefix:
def transcribe_clean(audio_array, sampling_rate=16000):
    text = transcribe(audio_array, sampling_rate)
    # Remove language token prefix
    if text.startswith("<") and ">" in text:
        text = text.split(">", 1)[-1].strip()
    return text

Training Details

  • Base model: facebook/w2v-bert-2.0
  • Learning rate: 5e-4 with cosine decay
  • Batch size: 8
  • Warmup ratio: 0.1
  • Weight decay: 0.01

Known Limitations

  1. Kamba and Kimeru show higher WER than other languages - this may be due to orthographic inconsistencies in test data
  2. No code-switching support - model identifies one language per utterance
  3. Language tokens are NOT special tokens - they appear in output; use post-processing to remove if needed

License

Apache 2.0

Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mutisya/w2v-bert-v3Hybrid-6lang-e10-v4

Evaluation results

  • Word Error Rate (User-Facing) on Multilingual ASR (6 languages)
    self-reported
    0.214