--- license: mit language: - tur # ISO 639-3 code or "und" if not identifiable tags: - tokenizer - bpe - flexitok - fineweb2 --- # Byte-Level BPE Tokenizer: tur_Latn (16K) A **Byte-Level BPE** tokenizer trained on **tur_Latn** data from Fineweb-2-HQ. ## Training Details | Parameter | Value | |-----------|-------| | Algorithm | Byte-Level BPE | | Language | `tur_Latn` | | Target Vocab Size | 16,000 | | Final Vocab Size | 16,000 | | Pre-tokenizer | gpt4 | | Number handling | individual | | Contraction handling | True | | Normalizer | NFC | | Special Tokens | ``, ``, ``, `` | | Training Shards | 2 | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("flexitok/feat2-bpe_tur_Latn_16000") tokens = tokenizer.encode("Hello, world!") ``` ## Files - `tokenizer.json` — Full HuggingFace tokenizer - `vocab.json` — Vocabulary mapping - `merges.txt` — BPE merge rules ## Sample Encoding | Text | Tokens | Token IDs | |------|--------|-----------| | `Hello, world! 12345 This is a test. こんにちは` | `H, el, lo, ,, Ġw, orld, !, Ġ, 12, 3, 45, ĠT, his, Ġis, Ġa, Ġtest, ., Ġ, ãģ, ĵ` | `43, 239, 1228, 15, 1085, 7548, 4, 177, 1270, 22, 3874, 335, 3309, 1065, 255, 1641, 17, 177, 5477, 197` |