πŸ‡°πŸ‡­ LFM2-Khmer-Merged-18K Tokenizer

πŸ”„ Merged Tokenizer for English + Khmer

This repository contains the merged tokenizer that integrates:

  • LiquidAI/LFM2-1.2B (English tokenizer)
  • Msok99/18k_tokenizer_v2 (Khmer tokenizer)

Resulting in a unified tokenizer capable of handling English, Khmer, numbers, math, and mixed-script text with excellent performance and round-trip accuracy.


🧩 Model Details

Attribute Description
Base Model LiquidAI/LFM2-1.2B
Merged Tokenizer Msok99/lfm2-khmer-merged-18k
Khmer Source Msok99/18k_tokenizer_v2
Vocab Size 81,127 tokens
Merge Type Vocabulary-level merge
Language(s) English + Khmer
Special Tokens Preserved (`<

πŸ“Š Merge Summary

Metric Value Status
LFM2 Original Vocabulary 64,400 βœ…
Khmer Tokenizer Vocabulary 18,001 βœ…
Vocabulary Overlap 1,274 βœ…
Tokens Added 16,727 βœ…
Final Vocabulary Size 81,127 βœ…
Duplication Rate 1.5% Excellent

βš™οΈ Performance Metrics

Test Type Merged TPC Khmer TPC Ratio Result
Pure Khmer (Compounds) 0.111 0.111 1.00Γ— βœ…
Khmer Sentences 0.143 0.143 1.00Γ— βœ…
Code-Switching (Simple) 0.368 0.316 1.17Γ— βœ…
Code-Switching (Complex) 0.220 0.240 0.92Γ— βœ…
Numbers / Mixed 0.367 0.388 0.95Γ— βœ…
Average Efficiency Ratio β€” β€” 1.01Γ— βœ… Excellent

βœ… 10/10 Round-trip encoding tests
βœ… 14/14 Edge cases handled (mixed script, long words, punctuation, etc.)


🧠 Key Highlights

  • Perfect Khmer compound preservation (e.g. β€œαž’αž‚αŸ’αž‚αž›αŸαžαžΆαž’αž·αž€αžΆαžšβ€ β†’ 1 token)
  • Seamless English–Khmer integration (code-switching friendly)
  • Full compatibility with LFM2-1.2B models
  • Mathematical and numeric token coverage

🧩 How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Msok99/lfm2-khmer-merged-18k")

text = "CEO αžšαž”αžŸαŸ‹ Apple αž“αž·αž„ Google"
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support