Zeineuski — Basque Dialect Identification

Fine-grained dialect identification (DID) system for Basque (Euskara). Given a text or speech sample, classifies it into one of six dialect categories: Western (Bizkaiera), Central (Gipuzkera), Navarrese, Navarrese-Labourdin, Souletin (Zuberera), or Standard Basque (Batua).

Source code: github.com/itzune/zeineuski

Architecture

Zeineuski uses a three-tier hierarchical classification architecture:

Tier 1: batua / dialectal (binary)
  └─ Tier 2: 5-class euskalkia (dialect classification)
       └─ Tier 3: 12-class azpieuskalkia (sub-dialect classification)

Classification taxonomy

The project follows Koldo Zuazo's dialect classification, which is the current linguistic consensus and the basis for Ahotsak.eus's municipality→dialect mapping.

Zuazo recognizes 6 euskalkiak (dialects):

# Euskalkia Our label Notes
1 Bizkaiera / Mendebalekoa western
2 Gipuzkera / Erdialdekoa central
3 Goi-nafarrera navarrese Upper Navarrese
4 Ekialdeko nafarrera / Erronkariera (merged into navarrese) Extinct ~1990s; tiny data
5 Zuberera souletin
6 Nafar-lapurtera nav-lab
+ Euskara batua batua Standard unified Basque

Why 5 euskalkis + batua instead of 6 + batua?

Ekialdeko nafarrera (Salazarese/Roncalese) is linguistically a distinct dialect, but it has been functionally extinct since the 1990s (last native speaker died in 1991). Ahotsak.eus has only ~65 passages across 7 towns in the Zaraitzu and Erronkari valleys. The Klasikoak.armiarma.eus classical literature corpus — which provides most of our Tier-2 training data — maps these texts to navarrese since the dialect distinction is not present in pre-20th-century literary sources.

For Tier 3 (azpieuskalkia), we follow the Zuazo azpieuskalki taxonomy as implemented on Ahotsak.eus. The official Ahotsak municipality→ azpieuskalki mapping provides the ground truth labels for sub-dialect classification.

Data Sources

Source Content Dialects Status
Klasikoak Literary texts (pre-20th c.) 5 euskalkis Train
Ahotsak.eus Oral history transcriptions 12 azpieuskalkis Train + Test
SÜ AZIA Pastoral scripts + blog articles Zuberera Train + Test

See docs/data_sources/suazia_zuberotarra.md for the SÜ AZIA corpus documentation.

Models

Euskalki (Dialect) Classification — 5 euskalkis + batua (6-class)

Hierarchical 2-step classifier (binary batua/dialectal → 5-class euskalkiak):

Variant Filename Size XNLI (3-class) Test (4-class) Batua F1
final hier_binary_final.bin + hier_dialect_final.bin 1.5GB 92.42% 95.18% 0.962
quantized hier_*_quantized.bin 417MB 92.38% 95.16% 0.961
compact hier_*_compact.bin 189MB 91.78% 94.71% 0.957
tiny hier_*_tiny.bin 112MB 91.90% 94.88% 0.961
web hier_binary_web.bin + hier_dialect_web.bin 32MB 91.06% 94.33% 0.952

Per-class F1 (final): Western 0.953, Central 0.933, Nav-Lab 0.949, Batua 0.962.

Azpieuskalki (Sub-Dialect) Classification — 12-class (v2, 2026-06-11)

Fine-grained sub-dialect classifier trained on Ahotsak.eus oral history transcriptions and augmented with the SÜ AZIA Zuberotarra corpus (6,676 pastoral + blog sentences).

Training data (42,229 sentences):

Azpieuskalki Sentences % Source
mendebal-sortaldea 13,059 30.9% Ahotsak
erdialde-sartaldea 9,804 23.2% Ahotsak
zuberera 6,050 14.3% Ahotsak (441) + SÜ AZIA (6,676)
erdialde-sortaldea 4,966 11.8% Ahotsak
nafar-ipar-sartaldea 1,966 4.7% Ahotsak
nafar-sortaldea 1,516 3.6% Ahotsak
naflap-sortaldea 1,395 3.3% Ahotsak
nafar-hego-sartaldea 1,101 2.6% Ahotsak
naflap-sartaldea 726 1.7% Ahotsak
ekialde-nafarra 710 1.7% Ahotsak
nafar-erdigunea 497 1.2% Ahotsak
mendebal-sartaldea 439 1.0% Ahotsak

Results (84.06% overall on 7,445 test samples):

Azpieuskalki Test Accuracy
zuberera 1,067 94.19%
mendebal-sortaldea 2,304 90.58%
nafar-ipar-sartaldea 346 88.15%
erdialde-sartaldea 1,729 83.40%
erdialde-sortaldea 876 79.11%
nafar-sortaldea 267 75.66%
naflap-sortaldea 246 71.95%
naflap-sartaldea 127 69.29%
ekialde-nafarra 125 68.00%
nafar-erdigunea 87 49.43%
nafar-hego-sartaldea 194 48.45%
mendebal-sartaldea 77 48.05%

Model variants:

Variant Filename Accuracy Size vs original
original azpieuskalki.bin 84.06% 243MB baseline
quantized azpieuskalki_q.bin 82.15% 22MB -1.91pp, 11× smaller
bucket=50K azpieuskalki_b50000.bin 83.47% 129MB -0.59pp, 1.9× smaller
bucket=50K Q azpieuskalki_b50000_q.bin 81.96% 5.5MB -2.10pp, 44× smaller

Usage

import fasttext

# Load a model
model = fasttext.load_model("azpieuskalki.bin")

# Predict
text = "Neská jin düzü, Zuñ néska?"
labels, probs = model.predict(text, k=3)
print(labels[0].replace("__label__", ""), probs[0])
# Output: zuberera 0.978

Or use the zeineuski CLI from the source repo:

uv run zeineuski predict --text "Gaur goizean goiz jaiki naiz"

Web Demo

Try it in your browser — no server, no install:

itzune.eus/euskalkid (source)

34MB of fastText models running via WebAssembly. Works offline after first load.

Training

Optimal hyperparameters (discovered via pi-autoresearch, 37 experiments over 3 sessions):

Azpieuskalki 12-class:

fasttext supervised -input train_azpieuskalki.txt -output azpieuskalki \
  -dim 200 -lr 0.2 -epoch 75 -wordNgrams 2 -minn 2 -maxn 6 -loss ns

Key insight: NO autotune — aggressive LR decay overfits to dominant classes. Character n-grams (minn=2,maxn=6) capture Basque morphological patterns (case endings, verb suffixes) that are dialect-specific (+9.4pp improvement).

Changelog

v2 (2026-06-11)

  • Added 6,676 SÜ AZIA Zuberotarra sentences (pastoral scripts + blog articles)
  • Zuberera training data: 750 → 7,117 sentences (1.9% → 14.3%)
  • Zuberera per-class accuracy: 94.19%
  • Overall 12-class accuracy: 84.06% (from 82.08% in v1)
  • New model variants: q (22MB), b50000 (129MB), b50000_q (5.5MB)

v1 (2026-06-08)

  • Initial 12-class azpieuskalki model: 82.08% accuracy
  • 9-class variant (min_samples=600): 83.55%

License

MIT

Downloads last month
5,019
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support