# B2NL v6.2.1 - Intelligent Tokenizer

**Byte-to-Natural Language Progressive Compression Tokenizer**

## Model Overview

**Release Date**: October 6, 2025
**Version**: 6.2.1
**Model Name**: B2NL-IntelligentTokenizer-v6.2.1
**Architecture**: Progressive Splitting Tokenizer with Multi-Query Attention

## Key Improvements from v6.1

### 🚀 Major Updates

1. **Extended Language Support**
   - **v6.1**: 6 languages (Korean, English, Chinese, Japanese, Spanish, Arabic)
   - **v6.2**: 204 languages with comprehensive multilingual training

2. **Model Efficiency**
   - **Parameter Reduction**: 245M → 137.9M (Encoder) + 106.8M (Decoder)
   - **Memory Optimization**: KV cache with 8x reduction through Multi-Query Attention (MQA)
   - **Training Stability**: Enhanced gradient handling for batch size 128+

3. **Compression Performance**
   - **Achievement**: 16:1 compression ratio (4x better than traditional BPE)
   - **Target Range**: 12:1 to 48:1 adaptive compression
   - **Token Efficiency**: 75% reduction in LLM API costs

## Technical Specifications

### Architecture Details
```yaml
Encoder:
  - Layers: 4
  - Hidden Dimension: 1280
  - Attention Heads: 16 (Query) / 2 (KV) - MQA
  - Compression: 48 bytes → 1-4 tokens

Decoder:
  - Layers: 6
  - Hidden Dimension: 1280
  - Cross-Attention: Multi-level (4 encoder layers)
  - Reconstruction: Byte-perfect recovery target
```

### Training Configuration
- **Dataset**: FLORES-200 by Meta AI (204 languages from all linguistic families)
- **Pre-training**: 10 epochs of byte relationship learning
- **Main Training**: 100 epochs with adaptive curriculum
- **Batch Size**: 128 (with gradient accumulation)
- **Optimizer**: AdamW with cosine annealing

## Performance Metrics

### Compression Ratio
| Metric | Value | Note |
|--------|-------|------|
| Average | 16:1 | 3 tokens per 48 bytes |
| Best Case | 48:1 | 1 token per 48 bytes |
| Worst Case | 12:1 | 4 tokens per 48 bytes |

### Reconstruction Quality
| Language Type | Accuracy | Samples |
|---------------|----------|---------|
| Single Chunk | 90%+ | < 46 bytes |
| Multi-Chunk | 85%+ | > 46 bytes |
| Average | 88% | All languages |

### Language Coverage
- **Isolating**: Chinese, Vietnamese, Thai (95%+ accuracy)
- **Agglutinative**: Korean, Turkish, Finnish (92%+ accuracy)
- **Fusional**: Spanish, Russian, Arabic (90%+ accuracy)
- **Polysynthetic**: Inuktitut, Mohawk (85%+ accuracy)

## Key Features

### 1. Progressive Splitting
- Dynamic token allocation (1-4 tokens per 48-byte chunk)
- Gumbel-Softmax for differentiable discrete selection
- Semantic boundary learning without hardcoding

### 2. Multi-Query Attention (MQA)
- 16 query heads → 2 KV heads
- 8x memory reduction in KV cache
- Maintained quality with reduced parameters

### 3. Adaptive Compression
- Content-aware compression decisions
- Preserves semantic boundaries
- Language-agnostic byte-level learning

## Use Cases

### Ideal For:
- **LLM Cost Reduction**: 75% token savings
- **Multilingual Applications**: 204 language support
- **Edge Deployment**: Reduced memory footprint
- **Real-time Processing**: Fixed 48-byte chunks

### Applications:
- Chat applications with token limits
- Document compression for LLMs
- Multilingual search systems
- Cross-lingual information retrieval

## Limitations

1. **Multi-chunk Processing**: Reconstruction quality decreases for very long texts
2. **Rare Languages**: Lower accuracy for extremely low-resource languages
3. **Domain Specificity**: Optimized for general text, not specialized domains

## Model Files

- `epoch_100.pt`: Final checkpoint (best performance)
- `config.yaml`: Training configuration
- `tokenizer.py`: Tokenizer implementation
- `unified_model.py`: Complete model architecture

## Citation

If you use this model in your research, please cite:

```bibtex
@software{b2nl_intelligent_tokenizer_2025,
  title = {B2NL-IntelligentTokenizer v6.2: Progressive Compression with 204 Language Support},
  author = {ggun1o},
  year = {2025},
  month = {10},
  version = {6.2.1},
  url = {https://huggingface.co/ggun1o/B2NL-IntelligentTokenizer-v6.2}
}
```

## Links

- **Paper**: [Coming Soon]
- **GitHub**: [Repository Link]
- **LinkedIn**: [Author Profile]
- **Demo Space**: https://huggingface.co/spaces/ggun1o/B2NL-v6.1.2

## License

Apache 2.0

## Acknowledgments

This work builds upon the B2NL (Byte-to-Natural Language) tokenization approach, extending it with progressive compression and comprehensive multilingual support.

---

**Contact**: For questions and collaborations, please reach out through GitHub issues or LinkedIn.