bpe tokenizer w byte-fallback: 32k vocab
BPE tokenizer for encoders/MLM objective with byte-pair fallback:
- Trained on
pints-ai/Expository-Prose-V1; this tokenizer is primarily for English and code. - this tokenizer is cased: "HELLO WORLD" is different than "hello world"
model_max_lengthis set to 1e9 to not cause hidden issues. Settokenizer.model_max_lengthto your model's max position embeddings when training.
Details
MLM Tokenizer Configuration (EN/Code)
Model:
- Type: BPE with byte_fallback
- Vocab: [UNK]=0, [CLS]=1, [SEP]=2, [PAD]=3, [MASK]=4
Pre-tokenization:
- ByteLevel (add_prefix_space=true, trim_offsets=true, use_regex=true)
Normalization:
- Remove null bytes (U+0000, U+FFFD)
- Remove control chars (except \t, \n, \r)
- NFC normalization
Post-processing:
- Single: [CLS] text [SEP]
- Pair: [CLS] text_a [SEP] text_b [SEP] (type_ids: 0, 1)
Decoder:
- ByteLevel (add_prefix_space=true, trim_offsets=true, use_regex=true)
Key Features:
- byte_fallback=true (no UNK on unknown chars)
- No dropout, no continuing_subword_prefix/end_of_word_suffix
- BERT-style sequence formatting with GPT-2 style byte-level encoding
after loading with autotokenizer and calling print(tokenizer):
PreTrainedTokenizerFast(
name_or_path=(
"repo-name",
),
vocab_size=32000,
model_max_length=1000000000.0,
is_fast=True,
padding_side='right',
truncation_side='right',
special_tokens={
'bos_token': '[CLS]',
'eos_token': '[SEP]',
'unk_token': '[UNK]',
'sep_token': '[SEP]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'mask_token': '[MASK]',
},
clean_up_tokenization_spaces=False,
added_tokens_decoder={
0: AddedToken(
"[UNK]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
1: AddedToken(
"[CLS]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
2: AddedToken(
"[SEP]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
3: AddedToken(
"[PAD]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
4: AddedToken(
"[MASK]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
},
)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support