lewtun HF Staff commited on
Commit
5f9cf2a
·
1 Parent(s): 78814a9

Disable normalization for special tokens

Browse files

This PR fixes an issue in the Mistral tokenizer where the special tokens aren't tokenized correctly if concatenated with other characters, e.g.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Gives correct IDs: {'input_ids': [2], 'attention_mask': [1]}
tokenizer("</s>", add_special_tokens=False)

# Gives correct IDs: {'input_ids': [842], 'attention_mask': [1]}
tokenizer(".", add_special_tokens=False)

# Gives incorrect IDs: {'input_ids': [842, 700, 28713, 28767], 'attention_mask': [1, 1, 1, 1]}
tokenizer(".</s>", add_special_tokens=False)
```

The solution is to disable normalization for all the special tokens, which is what this PR does. Note that until this PR is merged, the following workaround with the slow tokenizer can be adopted:

```python
from tokenizers import AddedToken
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", eos_token = AddedToken("</s>", normalized = False), from_slow = True)
```

Files changed (1) hide show
  1. tokenizer_config.json +3 -3
tokenizer_config.json CHANGED
@@ -5,7 +5,7 @@
5
  "0": {
6
  "content": "<unk>",
7
  "lstrip": false,
8
- "normalized": true,
9
  "rstrip": false,
10
  "single_word": false,
11
  "special": true
@@ -13,7 +13,7 @@
13
  "1": {
14
  "content": "<s>",
15
  "lstrip": false,
16
- "normalized": true,
17
  "rstrip": false,
18
  "single_word": false,
19
  "special": true
@@ -21,7 +21,7 @@
21
  "2": {
22
  "content": "</s>",
23
  "lstrip": false,
24
- "normalized": true,
25
  "rstrip": false,
26
  "single_word": false,
27
  "special": true
 
5
  "0": {
6
  "content": "<unk>",
7
  "lstrip": false,
8
+ "normalized": false,
9
  "rstrip": false,
10
  "single_word": false,
11
  "special": true
 
13
  "1": {
14
  "content": "<s>",
15
  "lstrip": false,
16
+ "normalized": false,
17
  "rstrip": false,
18
  "single_word": false,
19
  "special": true
 
21
  "2": {
22
  "content": "</s>",
23
  "lstrip": false,
24
+ "normalized": false,
25
  "rstrip": false,
26
  "single_word": false,
27
  "special": true