W2V-BERT 2.0 Multilingual ASR (6 Languages, 128-dim Adapters)
This model is a fine-tuned version of facebook/w2v-bert-2.0 for multilingual automatic speech recognition across 6 languages.
Supported Languages
| Language | Code | Token |
|---|---|---|
| English | eng | <eng_Latn> |
| Swahili | swh | <swh_Latn> |
| Kikuyu | kik | <kik_Latn> |
| Kamba | kam | <kam_Latn> |
| Kimeru | mer | <mer_Latn> |
| Luo | luo | <luo_Latn> |
Model Description
- Architecture: V3 Hybrid (MMS-style Adapters + Decoder Block)
- Base Model: facebook/w2v-bert-2.0
- Task: Multilingual Automatic Speech Recognition (ASR)
- Training: CTC loss with language token prefixes
Language Token Approach
This model uses language identification tokens prepended to transcripts:
- Training:
"<eng_Latn> hello world","<swh_Latn> habari dunia" - Inference: Model predicts language token first, providing automatic language identification
Note: This model identifies the primary language per utterance. It does not support code-switching (multiple languages within a single utterance).
Evaluation Results
User-Facing WER (Recommended)
This is the WER you can expect when running inference on individual samples. It reflects real-world usage patterns.
| Language | WER |
|---|---|
| English (eng) | 14.56% |
| Swahili (swh) | 24.10% |
| Kikuyu (kik) | 15.63% |
| Kamba (kam) | 32.73% |
| Kimeru (mer) | 33.94% |
| Luo (luo) | 16.86% |
| Overall | 21.36% |
Training Configuration
| Metric | Value |
|---|---|
| Training Samples | 90,000 |
| Samples per Language | 15,000 |
| Epochs | 10 |
Training Evaluation WER (for reference)
WER computed during training on batched evaluation. May differ from user-facing WER due to batch processing effects.
| Language | WER |
|---|---|
| English (eng) | 21.80% |
| Swahili (swh) | 21.22% |
| Kikuyu (kik) | 22.08% |
| Kamba (kam) | 20.81% |
| Kimeru (mer) | 20.88% |
| Luo (luo) | 21.43% |
| Overall | 21.37% |
Architecture Details
| Component | Value |
|---|---|
| Adapter Dimension | 128 |
| Decoder Layers | 1 |
| Vocabulary Size | 43 |
Usage
from transformers import Wav2Vec2BertProcessor, Wav2Vec2BertForCTC
import torch
# Load model and processor
processor = Wav2Vec2BertProcessor.from_pretrained("mutisya/w2v-bert-v3Hybrid-6lang-e10-v4")
model = Wav2Vec2BertForCTC.from_pretrained("mutisya/w2v-bert-v3Hybrid-6lang-e10-v4")
# Transcribe audio (language-agnostic mode)
def transcribe(audio_array, sampling_rate=16000):
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
# Result will be like: "<eng_Latn> hello world" or "<swh_Latn> habari dunia"
return transcription
# To remove language token prefix:
def transcribe_clean(audio_array, sampling_rate=16000):
text = transcribe(audio_array, sampling_rate)
# Remove language token prefix
if text.startswith("<") and ">" in text:
text = text.split(">", 1)[-1].strip()
return text
Training Details
- Base model: facebook/w2v-bert-2.0
- Learning rate: 5e-4 with cosine decay
- Batch size: 8
- Warmup ratio: 0.1
- Weight decay: 0.01
Known Limitations
- Kamba and Kimeru show higher WER than other languages - this may be due to orthographic inconsistencies in test data
- No code-switching support - model identifies one language per utterance
- Language tokens are NOT special tokens - they appear in output; use post-processing to remove if needed
License
Apache 2.0
- Downloads last month
- 13
Dataset used to train mutisya/w2v-bert-v3Hybrid-6lang-e10-v4
Evaluation results
- Word Error Rate (User-Facing) on Multilingual ASR (6 languages)self-reported0.214