Whisper Small Khmer ASR

Fine-tuned variant of openai/whisper-small for Khmer automatic speech recognition. The model was trained with the utilities in whisper and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.

Model Card

Attribute Value
Base model openai/whisper-small
Language Khmer (km-KH)
Task Automatic Speech Recognition (speech-to-text)
Sample rate 16 kHz audio, automatically resampled
Input length Up to 30 s clips (truncated during batching)
Finetuning data asr_mixed_dataset.txt (internal manifests, normalized through dataset_builder.segment_text)
Epochs 10
Batch size 2 (gradient accumulation 1)
Optimizer AdamW (managed by Seq2SeqTrainer)
Learning rate 1e-6 with cosine scheduler & 1k warmup steps
Normalization Khmer-specific regex and rule-based normalization (khmerspeech, khmercut)
Dataset Training with Mixed Khmer & English audio with 199K samples (225 hours), train all khmer public dataset + humaned label dataset
Training Time Training with Mixed precision with RTX-5090 VRAM 32GB for 1 days

Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.

Inference Examples

import torch
import torchaudio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


AUDIO_PATH = "audio_path.wav" 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "metythorn/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

speech_waveform, sr = torchaudio.load(AUDIO_PATH)

# Whisper expects 16kHz mono
if sr != 16000:
    speech_waveform = torchaudio.functional.resample(
        speech_waveform, 
        orig_freq=sr, 
        new_freq=16000
    )
speech_waveform = speech_waveform.squeeze().numpy()
result = pipe(speech_waveform)

print("Transcription:", result["text"])
Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support