COMBO-SEG Model for Western_Armenian

Model Description

This is a Western_Armenian-language character-level segmentation model based on COMBO-SEG, an open-source text segmentation system. It performs:

sentence segmentation
tokenisation (including multi-word token detection)

The Western_Armenian model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Western_Armenian-ArmTDP (UD v2.17).

Evaluation

Metric	Tokens	Words	Sentences
F1	99.94	97.62	99.78

Usage

Install the library from PyPI:

pip install combo-seg

from combo_seg import ComboSeg

# Load a pre-trained model
nlp = ComboSeg("Western_Armenian")

# Segment raw text — returns Document with hierarchy: Document -> Turn -> Sentence -> Token
doc = nlp("Արագ շագանակագոյն աղուէսը ծոյլ շան վրայէն կը ցատկէ։")

# Inspect results
for turn in doc.turns:
    for sentence in turn.sentences:
        print(f"Sentence: {sentence.text}")
        for token in sentence.tokens:
            if token.is_multi_word:
                print(f"  MWT: {token.text} -> {token.subwords}")
            else:
                print(f"  Token: {token.text}")

Or load directly from HuggingFace:

from combo_seg import ComboSeg

nlp = ComboSeg.from_pretrained("clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17")
doc = nlp("Արագ շագանակագոյն աղուէսը ծոյլ շան վրայէն կը ցատկէ։")

License

The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:

UD_Western_Armenian-ArmTDP LICENSE.txt

Citation

If you use this model, please cite:

Ulewicz, M., & Wróblewska, A. (2026). COMBO-SEG Models Trained on UD v2.17. https://doi.org/10.5281/zenodo.19651441

@software{combo_seg_2026,
  author    = {Ulewicz, Michał and Wróblewska, Alina},
  title     = {{COMBO-SEG} Models Trained on {UD} v2.17},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19651441},
  url       = {https://doi.org/10.5281/zenodo.19651441}
}

Resources

COMBO-SEG: https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg
UD_Western_Armenian-ArmTDP: https://github.com/UniversalDependencies/UD_Western_Armenian-ArmTDP

Downloads last month: -

Dataset used to train clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17

Collection including clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17

COMBO-SEG UD 2.17 Models

Collection

119 items • Updated Apr 27