π°π LFM2-Khmer-Merged-18K Tokenizer
π Merged Tokenizer for English + Khmer
This repository contains the merged tokenizer that integrates:
- LiquidAI/LFM2-1.2B (English tokenizer)
- Msok99/18k_tokenizer_v2 (Khmer tokenizer)
Resulting in a unified tokenizer capable of handling English, Khmer, numbers, math, and mixed-script text with excellent performance and round-trip accuracy.
π§© Model Details
| Attribute | Description |
|---|---|
| Base Model | LiquidAI/LFM2-1.2B |
| Merged Tokenizer | Msok99/lfm2-khmer-merged-18k |
| Khmer Source | Msok99/18k_tokenizer_v2 |
| Vocab Size | 81,127 tokens |
| Merge Type | Vocabulary-level merge |
| Language(s) | English + Khmer |
| Special Tokens | Preserved (`< |
π Merge Summary
| Metric | Value | Status |
|---|---|---|
| LFM2 Original Vocabulary | 64,400 | β |
| Khmer Tokenizer Vocabulary | 18,001 | β |
| Vocabulary Overlap | 1,274 | β |
| Tokens Added | 16,727 | β |
| Final Vocabulary Size | 81,127 | β |
| Duplication Rate | 1.5% | Excellent |
βοΈ Performance Metrics
| Test Type | Merged TPC | Khmer TPC | Ratio | Result |
|---|---|---|---|---|
| Pure Khmer (Compounds) | 0.111 | 0.111 | 1.00Γ | β |
| Khmer Sentences | 0.143 | 0.143 | 1.00Γ | β |
| Code-Switching (Simple) | 0.368 | 0.316 | 1.17Γ | β |
| Code-Switching (Complex) | 0.220 | 0.240 | 0.92Γ | β |
| Numbers / Mixed | 0.367 | 0.388 | 0.95Γ | β |
| Average Efficiency Ratio | β | β | 1.01Γ | β Excellent |
β
10/10 Round-trip encoding tests
β
14/14 Edge cases handled (mixed script, long words, punctuation, etc.)
π§ Key Highlights
- Perfect Khmer compound preservation (e.g. βα’αααααααΆαα·ααΆαβ β 1 token)
- Seamless EnglishβKhmer integration (code-switching friendly)
- Full compatibility with LFM2-1.2B models
- Mathematical and numeric token coverage
π§© How to Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Msok99/lfm2-khmer-merged-18k")
text = "CEO αααα Apple αα·α Google"
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support