Include chat template

#2
by notafraud - opened

The code of mistral-common is unreadable. You have probably forgotten that not everyone uses python for models inference.

Please add the chat template in some form, either to the model, or to the description/docs of mistral-common.

Honestly, at this point I don't understand what you are doing. It took you 3 iterations to "fix" repetitions on the main Mistral Small, and now for the updated Magistral you are throwing a tantrum with chat templates. Why?

The code of mistral-common is unreadable. You have probably forgotten that not everyone uses python for models inference.

Please add the chat template in some form, either to the model, or to the description/docs of mistral-common.

Honestly, at this point I don't understand what you are doing. It took you 3 iterations to "fix" repetitions on the main Mistral Small, and now for the updated Magistral you are throwing a tantrum with chat templates. Why?

Not sure what inference engine are you using, but Unsloth here https://huggingface.co/unsloth/Magistral-Small-2509 has the chat template. I'm downloading Unsloth's GGUF and that has the chat template as well. Hopefully this will work in LM Studio (or any other llamacpp based inference engine).

The code of mistral-common is unreadable. You have probably forgotten that not everyone uses python for models inference.

Please add the chat template in some form, either to the model, or to the description/docs of mistral-common.

Honestly, at this point I don't understand what you are doing. It took you 3 iterations to "fix" repetitions on the main Mistral Small, and now for the updated Magistral you are throwing a tantrum with chat templates. Why?

Dont disrespect anyone who knows much more than you.

The code of mistral-common is unreadable. You have probably forgotten that not everyone uses python for models inference.

Please add the chat template in some form, either to the model, or to the description/docs of mistral-common.

Honestly, at this point I don't understand what you are doing. It took you 3 iterations to "fix" repetitions on the main Mistral Small, and now for the updated Magistral you are throwing a tantrum with chat templates. Why?

Not sure what inference engine are you using, but Unsloth here https://huggingface.co/unsloth/Magistral-Small-2509 has the chat template. I'm downloading Unsloth's GGUF and that has the chat template as well. Hopefully this will work in LM Studio (or any other llamacpp based inference engine).

It works because the template is the same as it used to be, so it already exists there. However, any new inference engine will have to either find the chat template in the existing implementations, or look for older models.

In future, if Mistral continues to behave this, all implementations will have to use mistral-common, which is not ideal for running locally.

What do you expect, its the same for all models.

The code of mistral-common is unreadable. You have probably forgotten that not everyone uses python for models inference.

Please add the chat template in some form, either to the model, or to the description/docs of mistral-common.

Honestly, at this point I don't understand what you are doing. It took you 3 iterations to "fix" repetitions on the main Mistral Small, and now for the updated Magistral you are throwing a tantrum with chat templates. Why?

Dont disrespect anyone who knows much more than you.

Noone is safe from mistakes, and Mistral team has made multiple mistakes with the Small 24b series. Repetitions, censorship, shorter-than-instructed responses, broken tokenizer on the previous Magistral, at the very least.

Previous releases were solid and reliable.

What do you expect, its the same for all models.

You know it, I know it, a new user might not know it. There are no reasons to obscure chat templates.

Agree to disagree. And keep the hate to yourself.

Honestly speaking, this is a complete clusterfuck with these pattern searches, and nothing fucking works! For example, I still couldn't get function calling to work in vLLM.

Could anybody made this work with vllm-openai container? Ideally with tool calling? Or am I just stupid?

Could anybody made this work with vllm-openai container? Ideally with tool calling? Or am I just stupid?

no

for reference, below is an example response from official magistral gguf w/ fresh llama cpp (compiled 9/25 from latest). Note, unsloth's gguf mostly works, but something about the system prompt /template is broken too. It does not use thought tokens correctly or separate thought from it's final response, even when given correct system prompt.

i realize not everyone using gguf uses llama.cpp โ€“ but i would wager >50% of people who will try the official gguf are using something at least based on it.

mistral i love your models but if folks can't easily use them .. ๐Ÿ˜ฎโ€๐Ÿ’จ

official Q5_K_M

โžœ  ~ curl "http://.../v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "magistral-24b",
    "messages": [
      {
        "role": "system",
        "content": "First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response."
      },
      {
        "role": "user",
        "content": "Janet has 5 apples. She gives 2 to her friend and then buys 3 more. How many apples does she have now? Explain your reasoning."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 8192
  }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"\nmistral-v7-tekken2\n\n# The Power of the Curl: A Comprehensive Guide to Hair Curls\n\n## Introduction to Hair Curls\n\nHair curls are a fundamental aspect of hairstyling, offering a wide range of textures, styles, and versatility. [...goes on for ~1600 tokens ...] Whether youโ€™re a beginner or a seasoned pro, this guide provides the essential knowledge to help you achieve and maintain the perfect curls. Happy styling!\n\nmistral-v7-tekken\nmistral-v7-tekken2"}}],"created":1758856459,"model":"mistralai/Magistral-Small-2509","object":"chat.completion","usage":{"completion_tokens":1621,"prompt_tokens":8,"total_tokens":1629},"id":"chatcmpl-...","timings":{"cache_n":7,"prompt_n":1,"prompt_ms":22.546,"prompt_per_token_ms":22.546,"prompt_per_second":44.353765634702384,"predicted_n":1621,"predicted_ms":22997.78,"predicted_per_token_ms":14.187402837754473,"predicted_per_second":70.48506421054554}}%

unsloth Q8_0

โžœ  ~ curl "http://.../v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Magistral-Small-2509",
    "messages": [
      {
        "role": "system",
        "content": "First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response."
      },
      {
        "role": "user",
        "content": "Janet has 5 apples. She gives 2 to her friend and then buys 3 more. How many apples does she have now? Explain your reasoning."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 8192
  }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Let's start by understanding the problem step by step.\n\n1. Janet initially has 5 apples. This is our starting point.\n2. She gives 2 apples to her friend. To find out how many apples she has left after giving some away, we subtract the number of apples given away from the initial amount.\n   - Calculation: 5 apples - 2 apples = 3 apples\n   - So, after giving away 2 apples, Janet has 3 apples left.\n3. Then, she buys 3 more apples. To find the total number of apples now, we add the newly bought apples to the remaining apples.\n   - Calculation: 3 apples + 3 apples = 6 apples\n\nNow, let's verify the calculations to ensure accuracy.\n\n- Initial apples: 5\n- After giving away 2: 5 - 2 = 3\n- After buying 3 more: 3 + 3 = 6\n\nEverything checks out. The final number of apples Janet has is 6.\n\nNow, let's draft the response to explain the reasoning clearly.Janet starts with 5 apples. After giving 2 apples to her friend, she has 5 - 2 = 3 apples left. Then, she buys 3 more apples, so she now has 3 + 3 = 6 apples.\n\nSo, Janet has **6 apples** now."}}],"created":1758857295,"model":"mistralai/Magistral-Small-2509","object":"chat.completion","usage":{"completion_tokens":293,"prompt_tokens":150,"total_tokens":443},"id":"chatcmpl-...","timings":{"cache_n":0,"prompt_n":150,"prompt_ms":66.149,"prompt_per_token_ms":0.44099333333333335,"prompt_per_second":2267.6079759331205,"predicted_n":293,"predicted_ms":5181.552,"predicted_per_token_ms":17.68447781569966,"predicted_per_second":56.546764367123984}}%

Hi there,
I am also wondering how to use the model correctly.
I have the model in OpenWebUI and in AnythingLLM, both via a litellm proxy and serving with vllm.
Chat in AnythingLLM is fine, but Agents do not work.
In OpenWebUI I can only get it to work properly when setting the system prompt to the one from the model card in the UI.

So the problem seems to be, that the tools pass a system prompt to my endpoint (that actually makes sense for some things) and therefore override the default in vllm?!
It is very confusing to me.
If the default system prompt from OpenWebUI is passed, the model does not reason, but the content ends up in the reasoning_content. I guess this is caused by the template, which (again, I have not checked, I guess) prepends a [THINK] to the models response?

I would really love to use the model, since it is much better in languages like german than qwen3 vl 30b.

I would appreciate some clarification / help here.
Thanks!

It must be possible to use the model somehow, since mistral's api works just fine...

Sign up or log in to comment