🇰🇭 LFM2-Khmer-Merged-18K Tokenizer

🔄 Merged Tokenizer for English + Khmer

This repository contains the merged tokenizer that integrates:

LiquidAI/LFM2-1.2B (English tokenizer)
Msok99/18k_tokenizer_v2 (Khmer tokenizer)

Resulting in a unified tokenizer capable of handling English, Khmer, numbers, math, and mixed-script text with excellent performance and round-trip accuracy.

🧩 Model Details

Attribute	Description
Base Model	`LiquidAI/LFM2-1.2B`
Merged Tokenizer	`Msok99/lfm2-khmer-merged-18k`
Khmer Source	`Msok99/18k_tokenizer_v2`
Vocab Size	81,127 tokens
Merge Type	Vocabulary-level merge
Language(s)	English + Khmer
Special Tokens	Preserved (`<

📊 Merge Summary

Metric	Value	Status
LFM2 Original Vocabulary	64,400	✅
Khmer Tokenizer Vocabulary	18,001	✅
Vocabulary Overlap	1,274	✅
Tokens Added	16,727	✅
Final Vocabulary Size	81,127	✅
Duplication Rate	1.5%	Excellent

⚙️ Performance Metrics

Test Type	Merged TPC	Khmer TPC	Ratio	Result
Pure Khmer (Compounds)	0.111	0.111	1.00×	✅
Khmer Sentences	0.143	0.143	1.00×	✅
Code-Switching (Simple)	0.368	0.316	1.17×	✅
Code-Switching (Complex)	0.220	0.240	0.92×	✅
Numbers / Mixed	0.367	0.388	0.95×	✅
Average Efficiency Ratio	—	—	1.01×	✅ Excellent

✅ 10/10 Round-trip encoding tests
✅ 14/14 Edge cases handled (mixed script, long words, punctuation, etc.)

🧠 Key Highlights

Perfect Khmer compound preservation (e.g. “អគ្គលេខាធិការ” → 1 token)
Seamless English–Khmer integration (code-switching friendly)
Full compatibility with LFM2-1.2B models
Mathematical and numeric token coverage

🧩 How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Msok99/lfm2-khmer-merged-18k")

text = "CEO របស់ Apple និង Google"
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support