Potentially wrong tokenization of conversations

#10
by kimihailv - opened

Hello. This is your example from the readme:

prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

tokenizer.apply_chat_template adds a BOS token at the beginning of the string. Further tokenization with tokenizer([text], return_tensors="pt") adds a second BOS token. I suppose it is not desired behaviour.

Hi, thanks for your comment. You are right, it's not the desired behavior. We will update the README accordingly. You should disable the special tokens when tokenizing the text: model_inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).

Hello, I updated the README. I will close this issue.

nathanrchn changed discussion status to closed

Sign up or log in to comment