--- license: apache-2.0 language: - en --- # Nanochat Tokenizer This is the tokenizer from [Andrej Karpathy's](https://huggingface.co/karpathy) educational project [nanochat](https://huggingface.co/nanochat-students). This is the first step from the [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh) script. # Training For now, we need to download the first ~2B characters of pretraining dataset using the dataset script in nanochat. ``` export NANOCHAT_BASE_DIR=".cache/nanochat" mkdir -p $NANOCHAT_BASE_DIR python -m nanochat.dataset -n 8 ``` Then, we can train the tokenizer with vocab size ~2B characters of data ``` python -m scripts.tok_train --max_chars=2000000000 ``` And finally, evaluate: ``` python -m scripts.tok_eval ``` ## Tokenizer training timestamp: 2025-10-14 10:29:05 - max_chars: 2,000,000,000 - doc_cap: 10,000 - vocab_size: 65,536 - train_time: 52.9085 - num_special_tokens: 9 - token_bytes_min: 1 - token_bytes_max: 32 - token_bytes_mean: 6.9197 - token_bytes_std: 2.8748 ## Tokenizer evaluation timestamp: 2025-10-14 10:29:10 ### Comparison with GPT-2 | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | |-----------|-------|--------------|--------------|-------------|------------|-----------------| | news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% | | korean | 893 | 745 | 1.20 | 712 | 1.25 | +4.4% | | code | 1259 | 576 | 2.19 | 492 | 2.56 | +14.6% | | math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% | | science | 1112 | 260 | 4.28 | 228 | 4.88 | +12.3% | | fwe-train | 4208518 | 900364 | 4.67 | 856883 | 4.91 | +4.8% | | fwe-val | 4991242 | 1075364 | 4.64 | 1027241 | 4.86 | +4.5% | ### Comparison with GPT-4 | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | |-----------|-------|--------------|--------------|-------------|------------|-----------------| | news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% | | korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% | | code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% | | math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% | | science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% | | fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% | | fwe-val | 4991242 | 1048837 | 4.76 | 1027241 | 4.86 | +2.1% |