Instructions to use itzune/zeineuski with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use itzune/zeineuski with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("itzune/zeineuski", "model.bin")) - Notebooks
- Google Colab
- Kaggle
Zeineuski — Basque Dialect Identification
Fine-grained dialect identification (DID) system for Basque (Euskara). Given a text or speech sample, classifies it into one of six dialect categories: Western (Bizkaiera), Central (Gipuzkera), Navarrese, Navarrese-Labourdin, Souletin (Zuberera), or Standard Basque (Batua).
Source code: github.com/itzune/zeineuski
Architecture
Zeineuski uses a three-tier hierarchical classification architecture:
Tier 1: batua / dialectal (binary)
└─ Tier 2: 5-class euskalkia (dialect classification)
└─ Tier 3: 12-class azpieuskalkia (sub-dialect classification)
Classification taxonomy
The project follows Koldo Zuazo's dialect classification, which is the current linguistic consensus and the basis for Ahotsak.eus's municipality→dialect mapping.
Zuazo recognizes 6 euskalkiak (dialects):
| # | Euskalkia | Our label | Notes |
|---|---|---|---|
| 1 | Bizkaiera / Mendebalekoa | western |
|
| 2 | Gipuzkera / Erdialdekoa | central |
|
| 3 | Goi-nafarrera | navarrese |
Upper Navarrese |
| 4 | Ekialdeko nafarrera / Erronkariera | (merged into navarrese) | Extinct ~1990s; tiny data |
| 5 | Zuberera | souletin |
|
| 6 | Nafar-lapurtera | nav-lab |
|
| + | Euskara batua | batua |
Standard unified Basque |
Why 5 euskalkis + batua instead of 6 + batua?
Ekialdeko nafarrera (Salazarese/Roncalese) is linguistically a distinct dialect, but
it has been functionally extinct since the 1990s (last native speaker died in 1991).
Ahotsak.eus has only ~65 passages across 7 towns in the Zaraitzu and Erronkari valleys.
The Klasikoak.armiarma.eus classical literature corpus — which provides most of our
Tier-2 training data — maps these texts to navarrese since the dialect distinction
is not present in pre-20th-century literary sources.
For Tier 3 (azpieuskalkia), we follow the Zuazo azpieuskalki taxonomy as implemented on Ahotsak.eus. The official Ahotsak municipality→ azpieuskalki mapping provides the ground truth labels for sub-dialect classification.
Data Sources
| Source | Content | Dialects | Status |
|---|---|---|---|
| Klasikoak | Literary texts (pre-20th c.) | 5 euskalkis | Train |
| Ahotsak.eus | Oral history transcriptions | 12 azpieuskalkis | Train + Test |
| SÜ AZIA | Pastoral scripts + blog articles | Zuberera | Train + Test |
See docs/data_sources/suazia_zuberotarra.md for the SÜ AZIA corpus documentation.
Models
Euskalki (Dialect) Classification — 5 euskalkis + batua (6-class)
Hierarchical 2-step classifier (binary batua/dialectal → 5-class euskalkiak):
| Variant | Filename | Size | XNLI (3-class) | Test (4-class) | Batua F1 |
|---|---|---|---|---|---|
| final | hier_binary_final.bin + hier_dialect_final.bin |
1.5GB | 92.42% | 95.18% | 0.962 |
| quantized | hier_*_quantized.bin |
417MB | 92.38% | 95.16% | 0.961 |
| compact | hier_*_compact.bin |
189MB | 91.78% | 94.71% | 0.957 |
| tiny | hier_*_tiny.bin |
112MB | 91.90% | 94.88% | 0.961 |
| web | hier_binary_web.bin + hier_dialect_web.bin |
32MB | 91.06% | 94.33% | 0.952 |
Per-class F1 (final): Western 0.953, Central 0.933, Nav-Lab 0.949, Batua 0.962.
Azpieuskalki (Sub-Dialect) Classification — 12-class (v2, 2026-06-11)
Fine-grained sub-dialect classifier trained on Ahotsak.eus oral history transcriptions and augmented with the SÜ AZIA Zuberotarra corpus (6,676 pastoral + blog sentences).
Training data (42,229 sentences):
| Azpieuskalki | Sentences | % | Source |
|---|---|---|---|
| mendebal-sortaldea | 13,059 | 30.9% | Ahotsak |
| erdialde-sartaldea | 9,804 | 23.2% | Ahotsak |
| zuberera | 6,050 | 14.3% | Ahotsak (441) + SÜ AZIA (6,676) |
| erdialde-sortaldea | 4,966 | 11.8% | Ahotsak |
| nafar-ipar-sartaldea | 1,966 | 4.7% | Ahotsak |
| nafar-sortaldea | 1,516 | 3.6% | Ahotsak |
| naflap-sortaldea | 1,395 | 3.3% | Ahotsak |
| nafar-hego-sartaldea | 1,101 | 2.6% | Ahotsak |
| naflap-sartaldea | 726 | 1.7% | Ahotsak |
| ekialde-nafarra | 710 | 1.7% | Ahotsak |
| nafar-erdigunea | 497 | 1.2% | Ahotsak |
| mendebal-sartaldea | 439 | 1.0% | Ahotsak |
Results (84.06% overall on 7,445 test samples):
| Azpieuskalki | Test | Accuracy |
|---|---|---|
| zuberera | 1,067 | 94.19% |
| mendebal-sortaldea | 2,304 | 90.58% |
| nafar-ipar-sartaldea | 346 | 88.15% |
| erdialde-sartaldea | 1,729 | 83.40% |
| erdialde-sortaldea | 876 | 79.11% |
| nafar-sortaldea | 267 | 75.66% |
| naflap-sortaldea | 246 | 71.95% |
| naflap-sartaldea | 127 | 69.29% |
| ekialde-nafarra | 125 | 68.00% |
| nafar-erdigunea | 87 | 49.43% |
| nafar-hego-sartaldea | 194 | 48.45% |
| mendebal-sartaldea | 77 | 48.05% |
Model variants:
| Variant | Filename | Accuracy | Size | vs original |
|---|---|---|---|---|
| original | azpieuskalki.bin |
84.06% | 243MB | baseline |
| quantized | azpieuskalki_q.bin |
82.15% | 22MB | -1.91pp, 11× smaller |
| bucket=50K | azpieuskalki_b50000.bin |
83.47% | 129MB | -0.59pp, 1.9× smaller |
| bucket=50K Q | azpieuskalki_b50000_q.bin |
81.96% | 5.5MB | -2.10pp, 44× smaller |
Usage
import fasttext
# Load a model
model = fasttext.load_model("azpieuskalki.bin")
# Predict
text = "Neská jin düzü, Zuñ néska?"
labels, probs = model.predict(text, k=3)
print(labels[0].replace("__label__", ""), probs[0])
# Output: zuberera 0.978
Or use the zeineuski CLI from the source repo:
uv run zeineuski predict --text "Gaur goizean goiz jaiki naiz"
Web Demo
Try it in your browser — no server, no install:
34MB of fastText models running via WebAssembly. Works offline after first load.
Training
Optimal hyperparameters (discovered via pi-autoresearch, 37 experiments over 3 sessions):
Azpieuskalki 12-class:
fasttext supervised -input train_azpieuskalki.txt -output azpieuskalki \
-dim 200 -lr 0.2 -epoch 75 -wordNgrams 2 -minn 2 -maxn 6 -loss ns
Key insight: NO autotune — aggressive LR decay overfits to dominant classes. Character n-grams (minn=2,maxn=6) capture Basque morphological patterns (case endings, verb suffixes) that are dialect-specific (+9.4pp improvement).
Changelog
v2 (2026-06-11)
- Added 6,676 SÜ AZIA Zuberotarra sentences (pastoral scripts + blog articles)
- Zuberera training data: 750 → 7,117 sentences (1.9% → 14.3%)
- Zuberera per-class accuracy: 94.19%
- Overall 12-class accuracy: 84.06% (from 82.08% in v1)
- New model variants: q (22MB), b50000 (129MB), b50000_q (5.5MB)
v1 (2026-06-08)
- Initial 12-class azpieuskalki model: 82.08% accuracy
- 9-class variant (min_samples=600): 83.55%
License
MIT
- Downloads last month
- 5,019