How to run this model via vllm?

#2
by traphix - opened

How to run this model via vllm?

Try using the latest vllm main. The usual command vllm serve should work. Let me know!

cannot run using vllm 0.10.2

KeyError: 'layers.31.mlp.shared_expert.down_proj.weight'

maybe this pr fix the bug #24960

i will try later

Thank you so much!

i use latest vllm

new error comes

KeyError: 'layers.11.mlp.shared_expert_gate.weight_scale'

see #24960

Huh I'll spend some time looking into this today, thanks for putting out the fix!

Hi @traphix I'm on vllm Version: 0.10.2rc3.dev161+g087c6ffc9.d20250918.precompiled and the model runs correctly on 2 H100 machines for me. This is the short script I used:

if __name__ == '__main__':
    import torch

    from vllm import LLM, SamplingParams
    import torch

    prompts = [
        "The Swiss Alps are", 
        "Brad Marchand is",
        "The Toronto Maple Leafs are"
    ]

    # Create a sampling params object for greedy sampling
    sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10)
    llm = LLM("shanjiaz/qwen3-80b-fp8-dynamic", tensor_parallel_size=2, max_model_len=4096)
    output = llm.generate(prompts, sampling_params)
    for out in output:
        print(out.outputs[0].text)

These are the results:
Processed prompts: 33%|███▎ | 1/3 [00:45<01:30, 45.50s/it, est. speed input: 0.09 toks/s, output: 0.88 toks/s]
Processed prompts: 100%|██████████| 3/3 [00:45<00:00, 45.50s/it, est. speed input: 0.29 toks/s, output: 2.64 toks/s]
Processed prompts: 100%|██████████| 3/3 [00:45<00:00, 15.17s/it, est. speed input: 0.29 toks/s, output: 2.64 toks/s]
a mountain range in Switzerland, forming a major part of the larger Alpine mountain range that spans across several European countries. The highest peak in the Swiss Alps is Monte Rosa, which reaches an elevation of
a professional ice hockey player who has played for the Boston Bruins in the National Hockey League (NHL). He was selected in the 2006 NHL Entry Draft by the Bruins and has played
a professional ice hockey team based in Toronto, Ontario, competing in the National Hockey League (NHL) as a member of the Atlantic Division. The team has a rich history, including a 1
(Worker_TP0 pid=2533118) INFO 09-19 20:19:09 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP1 pid=2533119) INFO 09-19 20:19:09 [multiproc_executor.py:558] Parent process exited, terminating worker

i tried on 4 x L40s, but failed.

python3 -m vllm.entrypoints.openai.api_server \
    --served-model-name qwen3-next-80b-a3b-instruct \
    --model /data/model-cache/qwen3-80b-fp8-dynamic \
    --tensor-parallel-size 4 \
    --enable-expert-parallel

Can you share the error messages? Thanks!

i tried on 4 x L40s, but failed.

python3 -m vllm.entrypoints.openai.api_server \
    --served-model-name qwen3-next-80b-a3b-instruct \
    --model /data/model-cache/qwen3-80b-fp8-dynamic \
    --tensor-parallel-size 4 \
    --enable-expert-parallel

error logs, seems no useful info, "raise e from None"

(EngineCore_DP0 pid=279) EngineCore failed to start.
(EngineCore_DP0 pid=279) Traceback (most recent call last):
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=279)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=279)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=279)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=279)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=279)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=279)     self._init_executor()
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=279)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=279)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=279)     raise e from None
(EngineCore_DP0 pid=279) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=279) Process EngineCore_DP0:
(EngineCore_DP0 pid=279) Traceback (most recent call last):
(EngineCore_DP0 pid=279)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=279)     self.run()
(EngineCore_DP0 pid=279)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=279)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=279)     raise e
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=279)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=279)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=279)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=279)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=279)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=279)     self._init_executor()
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=279)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=279)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=279)     raise e from None
(EngineCore_DP0 pid=279) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.

I think you have to scroll up a little to see the real error. Thanks! Do you mind also sharing which version of vllm you're using?

Deal Helen,

  1. Looks like model.safetensors.index.json file doesn't contain any 'mtp.*' layers anymore (but original models does).
    Does llmcompressor 0.8.0 support exporting MTP layers (multi-token-prediction speculative decoding) now?
  2. The receipe of this model ignores 're:.self_attn.', but the qwen3_next_example.py (https://github.com/vllm-project/llm-compressor/pull/1886) doesn't show this layer exclusion. What is the best approach?

Thank you.

Sign up or log in to comment