Transformers
English

bpe tokenizer w byte-fallback: 32k vocab

BPE tokenizer for encoders/MLM objective with byte-pair fallback:

  • Trained on pints-ai/Expository-Prose-V1; this tokenizer is primarily for English and code.
  • this tokenizer is cased: "HELLO WORLD" is different than "hello world"
  • model_max_length is set to 1e9 to not cause hidden issues. Set tokenizer.model_max_length to your model's max position embeddings when training.

Details

MLM Tokenizer Configuration (EN/Code)

Model:

  • Type: BPE with byte_fallback
  • Vocab: [UNK]=0, [CLS]=1, [SEP]=2, [PAD]=3, [MASK]=4

Pre-tokenization:

  • ByteLevel (add_prefix_space=true, trim_offsets=true, use_regex=true)

Normalization:

  • Remove null bytes (U+0000, U+FFFD)
  • Remove control chars (except \t, \n, \r)
  • NFC normalization

Post-processing:

  • Single: [CLS] text [SEP]
  • Pair: [CLS] text_a [SEP] text_b [SEP] (type_ids: 0, 1)

Decoder:

  • ByteLevel (add_prefix_space=true, trim_offsets=true, use_regex=true)

Key Features:

  • byte_fallback=true (no UNK on unknown chars)
  • No dropout, no continuing_subword_prefix/end_of_word_suffix
  • BERT-style sequence formatting with GPT-2 style byte-level encoding

after loading with autotokenizer and calling print(tokenizer):

PreTrainedTokenizerFast(
    name_or_path=(
      "repo-name",
    ),
    vocab_size=32000,
    model_max_length=1000000000.0,
    is_fast=True,
    padding_side='right',
    truncation_side='right',
    special_tokens={
        'bos_token': '[CLS]',
        'eos_token': '[SEP]',
        'unk_token': '[UNK]',
        'sep_token': '[SEP]',
        'pad_token': '[PAD]',
        'cls_token': '[CLS]',
        'mask_token': '[MASK]',
    },
    clean_up_tokenization_spaces=False,
    added_tokens_decoder={
        0: AddedToken(
            "[UNK]",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        1: AddedToken(
            "[CLS]",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        2: AddedToken(
            "[SEP]",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        3: AddedToken(
            "[PAD]",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        4: AddedToken(
            "[MASK]",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
    },
)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train pszemraj/bytebpe-tokenizer-32k-mlm