COMBO-SEG Model for Western_Armenian

Model Description

This is a Western_Armenian-language character-level segmentation model based on COMBO-SEG, an open-source text segmentation system. It performs:

  • sentence segmentation
  • tokenisation (including multi-word token detection)

The Western_Armenian model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Western_Armenian-ArmTDP (UD v2.17).

Evaluation

Metric Tokens Words Sentences
F1 99.94 97.62 99.78

Usage

Install the library from PyPI:

pip install combo-seg
from combo_seg import ComboSeg

# Load a pre-trained model
nlp = ComboSeg("Western_Armenian")

# Segment raw text — returns Document with hierarchy: Document -> Turn -> Sentence -> Token
doc = nlp("Արագ շագանակագոյն աղուէսը ծոյլ շան վրայէն կը ցատկէ։")

# Inspect results
for turn in doc.turns:
    for sentence in turn.sentences:
        print(f"Sentence: {sentence.text}")
        for token in sentence.tokens:
            if token.is_multi_word:
                print(f"  MWT: {token.text} -> {token.subwords}")
            else:
                print(f"  Token: {token.text}")

Or load directly from HuggingFace:

from combo_seg import ComboSeg

nlp = ComboSeg.from_pretrained("clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17")
doc = nlp("Արագ շագանակագոյն աղուէսը ծոյլ շան վրայէն կը ցատկէ։")

License

The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:

Citation

If you use this model, please cite:

Ulewicz, M., & Wróblewska, A. (2026). COMBO-SEG Models Trained on UD v2.17. https://doi.org/10.5281/zenodo.19651441

@software{combo_seg_2026,
  author    = {Ulewicz, Michał and Wróblewska, Alina},
  title     = {{COMBO-SEG} Models Trained on {UD} v2.17},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19651441},
  url       = {https://doi.org/10.5281/zenodo.19651441}
}

Resources

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17

Collection including clarin-pl/combo-seg-xlm-roberta-base-western-armenian-armtdp-ud2.17