Potentially wrong tokenization of conversations

#10

by kimihailv - opened Sep 3

Sep 3

Hello. This is your example from the readme:

prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

tokenizer.apply_chat_template adds a BOS token at the beginning of the string. Further tokenization with tokenizer([text], return_tensors="pt") adds a second BOS token. I suppose it is not desired behaviour.

nathanrchn

Sep 3

Hi, thanks for your comment. You are right, it's not the desired behavior. We will update the README accordingly. You should disable the special tokens when tokenizing the text: model_inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).

nathanrchn

Sep 3

Hello, I updated the README. I will close this issue.

nathanrchn changed discussion status to closed Sep 3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment