Whisper Small Khmer ASR
Fine-tuned variant of openai/whisper-small for Khmer automatic speech recognition. The model was trained with the utilities in whisper and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.
Model Card
| Attribute | Value |
|---|---|
| Base model | openai/whisper-small |
| Language | Khmer (km-KH) |
| Task | Automatic Speech Recognition (speech-to-text) |
| Sample rate | 16 kHz audio, automatically resampled |
| Input length | Up to 30 s clips (truncated during batching) |
| Finetuning data | asr_mixed_dataset.txt (internal manifests, normalized through dataset_builder.segment_text) |
| Epochs | 10 |
| Batch size | 2 (gradient accumulation 1) |
| Optimizer | AdamW (managed by Seq2SeqTrainer) |
| Learning rate | 1e-6 with cosine scheduler & 1k warmup steps |
| Normalization | Khmer-specific regex and rule-based normalization (khmerspeech, khmercut) |
| Dataset | Training with Mixed Khmer & English audio with 199K samples (225 hours), train all khmer public dataset + humaned label dataset |
| Training Time | Training with Mixed precision with RTX-5090 VRAM 32GB for 1 days |
Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.
Inference Examples
import torch
import torchaudio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
AUDIO_PATH = "audio_path.wav"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "metythorn/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
task="automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
speech_waveform, sr = torchaudio.load(AUDIO_PATH)
# Whisper expects 16kHz mono
if sr != 16000:
speech_waveform = torchaudio.functional.resample(
speech_waveform,
orig_freq=sr,
new_freq=16000
)
speech_waveform = speech_waveform.squeeze().numpy()
result = pipe(speech_waveform)
print("Transcription:", result["text"])
- Downloads last month
- 1