How to get this running properly on vllm
Can someone tell me how to run this with thinking working all the time, even if a UI or user passes a custom system prompt to the openai compatible endpoint?
Please load using the following vllm command
vllm serve mistralai/Magistral-Small-2509 \
--tokenizer_mode mistral --config_format mistral \
--load_format mistral --tool-call-parser mistral \
--enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}' \
--tensor-parallel-size 2
with the following system prompt:
First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.
Please let me know if you have any further questions.
I am using your awq version and some other parameters, but I include everything from your start command.
I just saw where the problems might originate from:
(APIServer pid=1) WARNING 10-14 10:26:31 [mistral.py:429] Failed to convert token b'[THINK]' to id, replacing with
This should most likely not happen, and explains the problems I have with everything ending up in the reasoning content...
Dockerfile command:CMD ["/bin/bash", "-c", "vllm serve $MODEL_PATH --served-model-name $MODEL_NAME --swap-space $SWAP_SPACE --max-num-seqs $MAX_NUM_SEQS --max-model-len $CONTEXT_LENGTH --gpu-memory-utilization $GPU_MEMORY_UTILIZATION --tensor-parallel-size $TENSOR_PARALLEL_SIZE --api-key $API_KEY $EXTRA_VLLM_ARGS --host $HOST --port $PORT"]
And Docker Compose:
environment:
MODEL_PATH: "cpatonn/Magistral-Small-2509-AWQ-4bit"
MODEL_NAME: "magistral-small-2509"
CONTEXT_LENGTH: "90000"
SWAP_SPACE: "24"
MAX_NUM_SEQS: "2"
GPU_MEMORY_UTILIZATION: "0.75"
TENSOR_PARALLEL_SIZE: "2"
HOST: "0.0.0.0"
PORT: "8000"
API_KEY: "redacted"
HF_TOKEN: "redacted"
# Magistral 1.2:
EXTRA_VLLM_ARGS: "--enable-log-requests --trust-remote-code --config_format mistral --load_format mistral --tool-call-parser mistral --tokenizer_mode mistral --enable-auto-tool-choice --reasoning-parser mistral --limit-mm-per-prompt {\"image\":10}"
Update:
I also got (APIServer pid=1) WARNING 10-14 11:14:47 [mistral.py:429] Failed to convert token b'[/THINK]' to id, replacing with at some point.
It seems to me the parser is failing. If I do not set a parser, I at least get a full response and the model responds with the thinking tags...
Sadly I cannot test with the full model provided by Mistral, because I only have two 3090s.