Potentially wrong tokenization of conversations
#10
by
kimihailv
- opened
Hello. This is your example from the readme:
prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages_think,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
tokenizer.apply_chat_template adds a BOS token at the beginning of the string. Further tokenization with tokenizer([text], return_tensors="pt") adds a second BOS token. I suppose it is not desired behaviour.
Hi, thanks for your comment. You are right, it's not the desired behavior. We will update the README accordingly. You should disable the special tokens when tokenizing the text: model_inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).
Hello, I updated the README. I will close this issue.
nathanrchn
changed discussion status to
closed