--- tags: - byte-level-tokenization - compression - multilingual - korean - english - chinese - japanese - spanish - arabic license: apache-2.0 language: - ko - en - zh - ja - es - ar --- # B2NL v6.1.2 - Byte-to-Natural Language Tokenizer ## Model Description B2NL (Byte-to-Natural Language) v6.1.2 is a revolutionary byte-level tokenizer that achieves exceptional compression ratios through pure learning from bytes, without any predefined vocabulary. ## Key Features - **18.6:1 average compression** across 6 languages - **100% reconstruction accuracy** achieved - **6 core languages**: Korean, English, Chinese, Japanese, Spanish, Arabic - **64-byte chunks** for optimal processing - **Boundary learning system** for intelligent grouping - **No vocabulary needed** - pure byte-level processing ## Performance Metrics | Language Type | Languages | Compression Ratio | Reconstruction | |---------------|-----------|-------------------|----------------| | Isolating | Chinese | 39.0:1 | 100% | | Agglutinative | Korean, Japanese | 26.5:1 | 100% | | Fusional | English, Spanish | 5.4:1 | 100% | | Semitic | Arabic | 12.3:1 | 100% | | **Average** | **6 languages** | **18.6:1** | **100%** | ## Model Architecture - **Model Size**: 301.7M (lightweight!) - **Encoder**: 5-layer transformer with progressive dimensions [768, 896, 1024, 1152, 1280] - **Decoder**: 8-layer transformer with 1280d hidden size - **Cross-Attention**: 20 heads for relational learning - **Training**: 233 epochs on 6 languages ## Dataset - **Flores-200**: Multilingual machine translation benchmark - **6 languages** in current release (Korean, English, Chinese, Japanese, Spanish, Arabic) - **204 languages** support coming soon ## Usage ```python from huggingface_hub import hf_hub_download import torch # Download model model_path = hf_hub_download(repo_id="ggunio/B2NL-v6.1.2", filename="pytorch_model.bin") # Load model (requires custom model class) # See the demo space for full implementation ``` ## Demo Try the live demo: [B2NL v6.1.2 Demo](https://huggingface.co/spaces/ggunio/intelligent-tokenizer-v6-demo) ## Training Data The model was trained on text data from 6 languages: - Korean (한국어) - English - Chinese (中文) - Japanese (日本語) - Spanish (Español) - Arabic (العربية) ## Limitations - Current version trained on 6 languages only - Compression rates may vary with additional languages - Requires custom implementation for inference ## Citation ```bibtex @software{b2nl2025, title = {B2NL: Byte-to-Natural-Language Tokenizer v6.1.2}, author = {Jinhyun Woo}, year = {2025}, version = {6.1.2}, note = {18.6:1 compression, 100% reconstruction for 6 languages} } ``` ## 📬 Links - **GitHub**: [Repository](https://github.com/Woojiggun/intelligent-tokenizer) - **Demo**: [Try it live](https://huggingface.co/spaces/ggunio/intelligent-tokenizer-v6-demo) - **Paper**: [Read on Zenodo](https://zenodo.org/records/17116281?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImIyNWZiYTQyLWNiNGEtNDBmNi1iNTczLWVkMDJlNDI1YTQ1OSIsImRhdGEiOnt9LCJyYW5kb20iOiI0OWJkZWMzMjJjZTc3OTIwMTk4NTJlNTY1YmNjOGU1ZiJ9.Z_hXEp160tWBD5Qe2laQv1vhS4Js2a0R5BMWYs2PTG5vJMrc8l-BmPAIMya9O_HiN85jYZp-WOMOHg_DTHrg2A) | [PDF](Intelligent%20Tokenizer.pdf) --- ## License Apache 2.0