universal-dependencies/universal_dependencies
Updated • 4.1k • 36
This is a Western_Armenian-language character-level segmentation model based on COMBO-SEG, an open-source text segmentation system. It performs:
The Western_Armenian model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Western_Armenian-ArmTDP (UD v2.17).
| Metric | Tokens | Words | Sentences |
|---|---|---|---|
| F1 | 99.94 | 97.62 | 99.78 |
Install the library from PyPI:
pip install combo-seg
from combo_seg import ComboSeg
# Load a pre-trained model
nlp = ComboSeg("Western_Armenian")
# Segment raw text — returns Document with hierarchy: Document -> Turn -> Sentence -> Token
doc = nlp("Արագ շագանակագոյն աղուէսը ծոյլ շան վրայէն կը ցատկէ։")
# Inspect results
for turn in doc.turns:
for sentence in turn.sentences:
print(f"Sentence: {sentence.text}")
for token in sentence.tokens:
if token.is_multi_word:
print(f" MWT: {token.text} -> {token.subwords}")
else:
print(f" Token: {token.text}")
Or load directly from HuggingFace:
from combo_seg import ComboSeg
nlp = ComboSeg.from_pretrained("clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17")
doc = nlp("Արագ շագանակագոյն աղուէսը ծոյլ շան վրայէն կը ցատկէ։")
The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:
If you use this model, please cite:
Ulewicz, M., & Wróblewska, A. (2026). COMBO-SEG Models Trained on UD v2.17. https://doi.org/10.5281/zenodo.19651441
@software{combo_seg_2026,
author = {Ulewicz, Michał and Wróblewska, Alina},
title = {{COMBO-SEG} Models Trained on {UD} v2.17},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19651441},
url = {https://doi.org/10.5281/zenodo.19651441}
}