B2NL-IntelligentTokenizer v6.2.1
Progressive Byte-to-Natural Language Tokenizer with Learned Semantic Boundaries
β οΈ EXPERIMENTAL VERSION - TRAINING IN PROGRESS
Current Status (Updated: 2025-10-XX)
π Active Training: This model is currently being trained on a single RTX 4070 (12GB)
- Training progress: Day 1/9 (Epoch 11/100)
- Expected completion: ~7-9 days
π Current Performance:
- Compression ratio: 48:1 (aggressive experimental setting)
- Reconstruction accuracy: 97%+ (teacher forcing mode)
- Autoregressive training: Starting from epoch 31
β οΈ Known Limitations:
- Autoregressive generation not yet trained (epochs 1-30: teacher forcing only)
- Sliding window implementation has bugs (being fixed during training)
- 204 languages coverage is experimental - final performance TBD
π― Project Purpose:
This tokenizer is designed to enable more efficient LLM inference by separating language processing from reasoning:
- Byte-level processing: No vocabulary needed, truly universal
- Architecture separation:
- Language Processing Model (this tokenizer): Handles linguistic complexity
- Inference Model (LLM): Focuses purely on reasoning with compressed vectors
- Goal: The inference model receives only semantic vectors, allowing it to concentrate entirely on reasoning without language-specific overhead
This approach aims to make LLMs more efficient by offloading all linguistic processing to a specialized encoder/decoder.
π― Experiment Goal: Testing whether 48:1 compression is achievable across 204 languages with a single consumer GPU
Recommended: If you want higher completeness, please wait for training completion
π Key Innovation
Unlike traditional tokenizers (BPE, WordPiece) that split text using fixed rules, B2NL learns to identify semantic units within byte sequences through neural networks:
Traditional BPE: "μλ
νμΈμ" β "μ", "λ
", "ν", "μΈ", "μ" (5 tokens, word fragments)
B2NL: "μλ
νμΈμ" β [emb1, emb2, emb3] (3 embeddings, meaning preserved)
π Architecture: 16:1 Fixed Compression
[48 bytes input] β [Encoder] β [3 Γ 1280-dim embeddings] β [Decoder] β [48 bytes output]
β β
(with padding) (semantic compression)
Design Philosophy
- 48 bytes: Optimal for GPU parallelization and semantic unit capture
- 3 embeddings: Balances compression efficiency with information preservation
- Fixed ratio: Predictable costs for LLM APIs (always 16:1 compression)
Sliding Window for Long Texts
For texts exceeding 48 bytes, the model employs sliding window with 8-byte overlap:
Chunk 1: [Bytes 0-48] β 3 embeddings
β (8-byte overlap)
Chunk 2: [Bytes 40-88] β 3 embeddings
β (8-byte overlap)
Chunk 3: [Bytes 80-128] β 3 embeddings
π‘ Real-World Applications
- LLM Cost Reduction: 75% fewer tokens = significant cost savings
- Multilingual Search: Unified embedding space for 204 languages
- Edge Computing: Efficient compression for bandwidth-limited scenarios
- Cross-modal AI: Universal byte-level representation
- Document Processing: Consistent compression across all languages
π οΈ Technical Specifications
Model Architecture
- Encoder: 6-layer Transformer (137.9M parameters)
- Progressive dimension reduction
- Learned semantic boundary detection
- Decoder: 6-layer Transformer (106.8M parameters)
- Multi-Query Attention (12 query heads, 2 KV heads)
- Cross-attention with encoder hidden states
- Embedding Dimension: 1280
Training Details
- Dataset: FLORES-200 (204 languages, balanced multilingual corpus)
- Training: Teacher forcing with gradient accumulation
- Hardware: NVIDIA RTX 4070
- Epochs: 100
- Batch Size: 32
π» Usage
import torch
from unified_model import IntelligentTokenizerV62
# Load model
model = IntelligentTokenizerV62()
checkpoint = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()
# Compress text to embeddings
text = "Hello, world!"
compressed = model.compress(text)
print(f"Embeddings: {compressed['num_tokens']}") # 3
print(f"Compression: {compressed['compression_ratio']}") # 16.0
# Reconstruct from embeddings
reconstructed = model.generate(text, temperature=0.1)
print(f"Original: {text}")
print(f"Reconstructed: {reconstructed}")
π Roadmap
- v6.3: Autoregressive training for improved generation quality
- v6.4: Non-autoregressive mode for 10x speedup
- v7.0: Dynamic compression ratios (8:1 to 32:1)
π Citation
@software{b2nl_tokenizer_2025,
title = {B2NL-IntelligentTokenizer: Progressive Byte-to-Natural Language Tokenization with Learned Semantic Boundaries},
author = {Jinhyun Woo},
year = {2025},
month = {10},
version = {6.2.1},
url = {https://huggingface.co/ggunio/B2NL-IntelligentTokenizer-v6.2.1}
}
π Links
- Author: Jinhyun Woo
- GitHub: Woojiggun/intelligent-tokenizer
- Paper: Zenodo Publication
- Demo: Live Demo Space
License
Apache 2.0
- Downloads last month
- 4
Space using ggunio/B2NL-IntelligentTokenizer-v6.2.1 1
Evaluation results
- Fixed Compression Ratioself-reported16.000
- Korean Reconstruction (single chunk)self-reported95.000
- English Reconstructionself-reported92.300