How to get this running properly on vllm

by fighter3005 - opened 21 days ago

21 days ago

Can someone tell me how to run this with thinking working all the time, even if a UI or user passes a custom system prompt to the openai compatible endpoint?

cpatonn

Owner 20 days ago

Please load using the following vllm command

vllm serve mistralai/Magistral-Small-2509 \
  --tokenizer_mode mistral --config_format mistral \
  --load_format mistral --tool-call-parser mistral \
  --enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}' \
  --tensor-parallel-size 2

with the following system prompt:

First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.

Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.

Please let me know if you have any further questions.

fighter3005

20 days ago

•

edited 19 days ago

I am using your awq version and some other parameters, but I include everything from your start command.
I just saw where the problems might originate from:

(APIServer pid=1) WARNING 10-14 10:26:31 [mistral.py:429] Failed to convert token b'[THINK]' to id, replacing with

This should most likely not happen, and explains the problems I have with everything ending up in the reasoning content...

Dockerfile command:
CMD ["/bin/bash", "-c", "vllm serve $MODEL_PATH --served-model-name $MODEL_NAME --swap-space $SWAP_SPACE --max-num-seqs $MAX_NUM_SEQS --max-model-len $CONTEXT_LENGTH --gpu-memory-utilization $GPU_MEMORY_UTILIZATION --tensor-parallel-size $TENSOR_PARALLEL_SIZE --api-key $API_KEY $EXTRA_VLLM_ARGS --host $HOST --port $PORT"]

And Docker Compose:

    environment:
      MODEL_PATH: "cpatonn/Magistral-Small-2509-AWQ-4bit"
      MODEL_NAME: "magistral-small-2509"
      CONTEXT_LENGTH: "90000"
      SWAP_SPACE: "24"
      MAX_NUM_SEQS: "2"
      GPU_MEMORY_UTILIZATION: "0.75"
      TENSOR_PARALLEL_SIZE: "2"
      HOST: "0.0.0.0"
      PORT: "8000"
      API_KEY: "redacted"
      HF_TOKEN: "redacted"
      # Magistral 1.2:
      EXTRA_VLLM_ARGS: "--enable-log-requests --trust-remote-code --config_format mistral --load_format mistral --tool-call-parser mistral --tokenizer_mode mistral --enable-auto-tool-choice --reasoning-parser mistral --limit-mm-per-prompt {\"image\":10}"

Update:
I also got (APIServer pid=1) WARNING 10-14 11:14:47 [mistral.py:429] Failed to convert token b'[/THINK]' to id, replacing with at some point.

It seems to me the parser is failing. If I do not set a parser, I at least get a full response and the model responds with the thinking tags...
Sadly I cannot test with the full model provided by Mistral, because I only have two 3090s.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment