---
license: openrail
language:
  - da
base_model: Qwen/Qwen3-ASR-1.7B
tags:
  - automatic-speech-recognition
  - danish
  - qwen
  - asr
  - speech-to-text
  - coral
  - podcast
  - streaming
datasets:
  - alexandrainst/coral
library_name: transformers
pipeline_tag: automatic-speech-recognition
metrics:
  - wer
  - cer
model-index:
  - name: milo-asr
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: alexandrainst/coral
          name: CoRal v2 Test
          split: test
        metrics:
          - type: wer
            value: 23.24
            name: WER
          - type: cer
            value: 11.17
            name: CER
---

# Milo-ASR: Dansk ASR Model

**Milo-ASR** er en dansk speech-to-text model baseret på [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B), finetuned til at forstå dansk - både oplæst tale og samtaler/podcasts.

Modellen er trænet på [CoRal v2](https://huggingface.co/datasets/alexandrainst/coral) + danske podcast-data, så den klarer sig godt på tværs af domæner. De fleste andre modeller er kun gode til enten det ene eller det andet.

## Resultater

### CoRal v2 (oplæst tale, 10.370 samples)

| Model | WER | CER |
|-------|-----|-----|
| hviske-v2 (Whisper v2) | **17.40%** | **7.96%** |
| hviske-v3 (Whisper v3) | 21.62% | 9.22% |
| **Milo-ASR** | 23.24% | 11.17% |
| Whisper v3 Turbo | 40.35% | 15.51% |
| Qwen3-ASR base | 46.28% | 19.78% |

### Podcast (samtaler, 500 samples)

| Model | WER | CER |
|-------|-----|-----|
| **Milo-ASR** | **21.82%** | **15.64%** |
| hviske-v2 (Whisper v2) | 50.67% | 38.31% |
| Whisper v3 Turbo | 67.03% | 45.98% |
| Qwen3-ASR base | 67.52% | 47.71% |
| hviske-v3 (Whisper v3) | 67.65% | 50.12% |

Milo-ASR er den eneste model der klarer begge domæner godt. På podcasts er den **2.3x bedre** end næstbedste model (hviske-v2).

### Plots

![CoRal v2 WER](plots/wer_comparison.png)
![Podcast WER](plots/wer_comparison_podcast.png)
![CoRal vs Podcast](plots/dual_domain_wer.png)
![Speed](plots/accuracy_vs_speed.png)

---

## Quick Start

```bash
pip install qwen-asr transformers torch
```

```python
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "pluttodk/Milo-ASR",
    dtype="bfloat16",
    device_map="cuda:0",
)

results = model.transcribe(
    audio="path/to/danish_audio.wav",
    language="Danish",
)

print(results[0].text)
```

### Batch Transcription

```python
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = model.transcribe(audio=audio_files, language="Danish")

for r in results:
    print(r.text)
```

### Timestamps

```python
model = Qwen3ASRModel.from_pretrained(
    "pluttodk/Milo-ASR",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    dtype="bfloat16",
    device_map="cuda:0",
)

results = model.transcribe(
    audio="path/to/audio.wav",
    language="Danish",
    return_time_stamps=True,
)

for item in results[0].time_stamps.items:
    print(f"{item.start_time:.2f}s - {item.end_time:.2f}s: {item.text}")
```

### Streaming (vLLM)

```python
model = Qwen3ASRModel.LLM(
    model="pluttodk/Milo-ASR",
    gpu_memory_utilization=0.8,
)

state = model.init_streaming_state(language="Danish", chunk_size_sec=2.0)

for audio_chunk in audio_stream():
    state = model.streaming_transcribe(audio_chunk, state)
    print(state.text)

state = model.finish_streaming_transcribe(state)
```

---

## Træningsdetaljer

Modellen er finetuned i to stages:

1. **Stage 1**: Qwen3-ASR-1.7B finetuned på CoRal v2 (~250K samples, 3 epochs, lr=2e-5)
2. **Stage 2**: Fortsat fra stage 1 checkpoint på podcast + Azure podcast data (~141K samples, 8 epochs, lr=1e-5, cosine schedule)

| Parameter | Stage 2 |
|-----------|---------|
| Learning rate | 1e-5 |
| Batch size | 8 (x4 grad acc = 32 effective) |
| Epochs | 8 |
| LR scheduler | Cosine |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Precision | bfloat16 |
| Training steps | 35,560 |

---

## Ting nedarvet fra Qwen3-ASR

- Streaming/real-time via vLLM
- Sang detection (baggrundsmusik)
- Word-level timestamps
- 30+ sprog (dansk optimeret)
- Op til 20 min audio pr. request

---

## Citation

```bibtex
@misc{Milo-ASR,
  author = {Rønnelund, Mathias Oliver Valdbjørn},
  title = {Milo-ASR: Danish ASR Model based on Qwen3-ASR},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/pluttodk/Milo-ASR}
}
```

## Acknowledgements

- [Qwen Team](https://github.com/QwenLM) for Qwen3-ASR
- [Alexandra Institute](https://alexandra.dk/) for CoRal v2