B2NL-IntelligentTokenizer v6.2.1

Progressive Byte-to-Natural Language Tokenizer with Learned Semantic Boundaries

⚠️ EXPERIMENTAL VERSION - TRAINING IN PROGRESS

Current Status (Updated: 2025-10-XX)

πŸ”„ Active Training: This model is currently being trained on a single RTX 4070 (12GB)

  • Training progress: Day 1/9 (Epoch 11/100)
  • Expected completion: ~7-9 days

πŸ“Š Current Performance:

  • Compression ratio: 48:1 (aggressive experimental setting)
  • Reconstruction accuracy: 97%+ (teacher forcing mode)
  • Autoregressive training: Starting from epoch 31

⚠️ Known Limitations:

  • Autoregressive generation not yet trained (epochs 1-30: teacher forcing only)
  • Sliding window implementation has bugs (being fixed during training)
  • 204 languages coverage is experimental - final performance TBD

🎯 Project Purpose:

This tokenizer is designed to enable more efficient LLM inference by separating language processing from reasoning:

  • Byte-level processing: No vocabulary needed, truly universal
  • Architecture separation:
    • Language Processing Model (this tokenizer): Handles linguistic complexity
    • Inference Model (LLM): Focuses purely on reasoning with compressed vectors
  • Goal: The inference model receives only semantic vectors, allowing it to concentrate entirely on reasoning without language-specific overhead

This approach aims to make LLMs more efficient by offloading all linguistic processing to a specialized encoder/decoder.

🎯 Experiment Goal: Testing whether 48:1 compression is achievable across 204 languages with a single consumer GPU

Recommended: If you want higher completeness, please wait for training completion

πŸš€ Key Innovation

Unlike traditional tokenizers (BPE, WordPiece) that split text using fixed rules, B2NL learns to identify semantic units within byte sequences through neural networks:

Traditional BPE:  "μ•ˆλ…•ν•˜μ„Έμš”" β†’ "μ•ˆ", "λ…•", "ν•˜", "μ„Έ", "μš”" (5 tokens, word fragments)
B2NL:            "μ•ˆλ…•ν•˜μ„Έμš”" β†’ [emb1, emb2, emb3] (3 embeddings, meaning preserved)

πŸ“Š Architecture: 16:1 Fixed Compression

[48 bytes input] β†’ [Encoder] β†’ [3 Γ— 1280-dim embeddings] β†’ [Decoder] β†’ [48 bytes output]
         ↑                              ↓
    (with padding)             (semantic compression)

Design Philosophy

  • 48 bytes: Optimal for GPU parallelization and semantic unit capture
  • 3 embeddings: Balances compression efficiency with information preservation
  • Fixed ratio: Predictable costs for LLM APIs (always 16:1 compression)

Sliding Window for Long Texts

For texts exceeding 48 bytes, the model employs sliding window with 8-byte overlap:

Chunk 1: [Bytes 0-48]   β†’ 3 embeddings
              ↓ (8-byte overlap)
Chunk 2: [Bytes 40-88]  β†’ 3 embeddings
              ↓ (8-byte overlap)
Chunk 3: [Bytes 80-128] β†’ 3 embeddings

πŸ’‘ Real-World Applications

  • LLM Cost Reduction: 75% fewer tokens = significant cost savings
  • Multilingual Search: Unified embedding space for 204 languages
  • Edge Computing: Efficient compression for bandwidth-limited scenarios
  • Cross-modal AI: Universal byte-level representation
  • Document Processing: Consistent compression across all languages

πŸ› οΈ Technical Specifications

Model Architecture

  • Encoder: 6-layer Transformer (137.9M parameters)
    • Progressive dimension reduction
    • Learned semantic boundary detection
  • Decoder: 6-layer Transformer (106.8M parameters)
    • Multi-Query Attention (12 query heads, 2 KV heads)
    • Cross-attention with encoder hidden states
  • Embedding Dimension: 1280

Training Details

  • Dataset: FLORES-200 (204 languages, balanced multilingual corpus)
  • Training: Teacher forcing with gradient accumulation
  • Hardware: NVIDIA RTX 4070
  • Epochs: 100
  • Batch Size: 32

πŸ’» Usage

import torch
from unified_model import IntelligentTokenizerV62

# Load model
model = IntelligentTokenizerV62()
checkpoint = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

# Compress text to embeddings
text = "Hello, world!"
compressed = model.compress(text)
print(f"Embeddings: {compressed['num_tokens']}")  # 3
print(f"Compression: {compressed['compression_ratio']}")  # 16.0

# Reconstruct from embeddings
reconstructed = model.generate(text, temperature=0.1)
print(f"Original: {text}")
print(f"Reconstructed: {reconstructed}")

πŸ“ˆ Roadmap

  • v6.3: Autoregressive training for improved generation quality
  • v6.4: Non-autoregressive mode for 10x speedup
  • v7.0: Dynamic compression ratios (8:1 to 32:1)

πŸ“ Citation

@software{b2nl_tokenizer_2025,
  title = {B2NL-IntelligentTokenizer: Progressive Byte-to-Natural Language Tokenization with Learned Semantic Boundaries},
  author = {Jinhyun Woo},
  year = {2025},
  month = {10},
  version = {6.2.1},
  url = {https://huggingface.co/ggunio/B2NL-IntelligentTokenizer-v6.2.1}
}

πŸ”— Links

License

Apache 2.0

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using ggunio/B2NL-IntelligentTokenizer-v6.2.1 1

Evaluation results