B2NL-IntelligentTokenizer v6.2.1

Progressive Byte-to-Natural Language Tokenizer with Learned Semantic Boundaries

⚠️ EXPERIMENTAL VERSION - TRAINING IN PROGRESS

Current Status (Updated: 2025-10-XX)

🔄 Active Training: This model is currently being trained on a single RTX 4070 (12GB)

Training progress: Day 1/9 (Epoch 11/100)
Expected completion: ~7-9 days

📊 Current Performance:

Compression ratio: 48:1 (aggressive experimental setting)
Reconstruction accuracy: 97%+ (teacher forcing mode)
Autoregressive training: Starting from epoch 31

⚠️ Known Limitations:

Autoregressive generation not yet trained (epochs 1-30: teacher forcing only)
Sliding window implementation has bugs (being fixed during training)
204 languages coverage is experimental - final performance TBD

🎯 Project Purpose:

This tokenizer is designed to enable more efficient LLM inference by separating language processing from reasoning:

Byte-level processing: No vocabulary needed, truly universal
Architecture separation:
- Language Processing Model (this tokenizer): Handles linguistic complexity
- Inference Model (LLM): Focuses purely on reasoning with compressed vectors
Goal: The inference model receives only semantic vectors, allowing it to concentrate entirely on reasoning without language-specific overhead

This approach aims to make LLMs more efficient by offloading all linguistic processing to a specialized encoder/decoder.

🎯 Experiment Goal: Testing whether 48:1 compression is achievable across 204 languages with a single consumer GPU

Recommended: If you want higher completeness, please wait for training completion

🚀 Key Innovation

Unlike traditional tokenizers (BPE, WordPiece) that split text using fixed rules, B2NL learns to identify semantic units within byte sequences through neural networks:

Traditional BPE:  "안녕하세요" → "안", "녕", "하", "세", "요" (5 tokens, word fragments)
B2NL:            "안녕하세요" → [emb1, emb2, emb3] (3 embeddings, meaning preserved)

📊 Architecture: 16:1 Fixed Compression

[48 bytes input] → [Encoder] → [3 × 1280-dim embeddings] → [Decoder] → [48 bytes output]
         ↑                              ↓
    (with padding)             (semantic compression)

Design Philosophy

48 bytes: Optimal for GPU parallelization and semantic unit capture
3 embeddings: Balances compression efficiency with information preservation
Fixed ratio: Predictable costs for LLM APIs (always 16:1 compression)

Sliding Window for Long Texts

For texts exceeding 48 bytes, the model employs sliding window with 8-byte overlap:

Chunk 1: [Bytes 0-48]   → 3 embeddings
              ↓ (8-byte overlap)
Chunk 2: [Bytes 40-88]  → 3 embeddings
              ↓ (8-byte overlap)
Chunk 3: [Bytes 80-128] → 3 embeddings

💡 Real-World Applications

LLM Cost Reduction: 75% fewer tokens = significant cost savings
Multilingual Search: Unified embedding space for 204 languages
Edge Computing: Efficient compression for bandwidth-limited scenarios
Cross-modal AI: Universal byte-level representation
Document Processing: Consistent compression across all languages

🛠️ Technical Specifications

Model Architecture

Encoder: 6-layer Transformer (137.9M parameters)
- Progressive dimension reduction
- Learned semantic boundary detection
Decoder: 6-layer Transformer (106.8M parameters)
- Multi-Query Attention (12 query heads, 2 KV heads)
- Cross-attention with encoder hidden states
Embedding Dimension: 1280

Training Details

Dataset: FLORES-200 (204 languages, balanced multilingual corpus)
Training: Teacher forcing with gradient accumulation
Hardware: NVIDIA RTX 4070
Epochs: 100
Batch Size: 32

💻 Usage

import torch
from unified_model import IntelligentTokenizerV62

# Load model
model = IntelligentTokenizerV62()
checkpoint = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

# Compress text to embeddings
text = "Hello, world!"
compressed = model.compress(text)
print(f"Embeddings: {compressed['num_tokens']}")  # 3
print(f"Compression: {compressed['compression_ratio']}")  # 16.0

# Reconstruct from embeddings
reconstructed = model.generate(text, temperature=0.1)
print(f"Original: {text}")
print(f"Reconstructed: {reconstructed}")

📈 Roadmap

v6.3: Autoregressive training for improved generation quality
v6.4: Non-autoregressive mode for 10x speedup
v7.0: Dynamic compression ratios (8:1 to 32:1)

📝 Citation

@software{b2nl_tokenizer_2025,
  title = {B2NL-IntelligentTokenizer: Progressive Byte-to-Natural Language Tokenization with Learned Semantic Boundaries},
  author = {Jinhyun Woo},
  year = {2025},
  month = {10},
  version = {6.2.1},
  url = {https://huggingface.co/ggunio/B2NL-IntelligentTokenizer-v6.2.1}
}

🔗 Links

Author: Jinhyun Woo
GitHub: Woojiggun/intelligent-tokenizer
Paper: Zenodo Publication
Demo: Live Demo Space

License

Apache 2.0

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using ggunio/B2NL-IntelligentTokenizer-v6.2.1 1

Evaluation results

Fixed Compression Ratio
self-reported

16.000
Korean Reconstruction (single chunk)
self-reported

95.000
English Reconstruction
self-reported

92.300

Metadata error: specify a dataset to view leaderboard