Whisper Small Khmer ASR

Fine-tuned variant of openai/whisper-small for Khmer automatic speech recognition. The model was trained with the utilities in whisper and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.

Model Card

Attribute	Value
Base model	`openai/whisper-small`
Language	Khmer (`km-KH`)
Task	Automatic Speech Recognition (speech-to-text)
Sample rate	16 kHz audio, automatically resampled
Input length	Up to 30 s clips (truncated during batching)
Finetuning data	`asr_mixed_dataset.txt` (internal manifests, normalized through `dataset_builder.segment_text`)
Epochs	10
Batch size	2 (gradient accumulation 1)
Optimizer	AdamW (managed by `Seq2SeqTrainer`)
Learning rate	1e-6 with cosine scheduler & 1k warmup steps
Normalization	Khmer-specific regex and rule-based normalization (`khmerspeech`, `khmercut`)
Dataset	Training with Mixed Khmer & English audio with 199K samples (225 hours), train all khmer public dataset + humaned label dataset
Training Time	Training with Mixed precision with RTX-5090 VRAM 32GB for 1 days

Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.

Inference Examples

import torch
import torchaudio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


AUDIO_PATH = "audio_path.wav" 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "metythorn/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

speech_waveform, sr = torchaudio.load(AUDIO_PATH)

# Whisper expects 16kHz mono
if sr != 16000:
    speech_waveform = torchaudio.functional.resample(
        speech_waveform, 
        orig_freq=sr, 
        new_freq=16000
    )
speech_waveform = speech_waveform.squeeze().numpy()
result = pipe(speech_waveform)

print("Transcription:", result["text"])

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32