Failed to run VLLM batch inference
#9
by
Junxiao-Zhao
- opened
While both Instruct and Thinking worked using vllm serve, they failed to run the batch inference:
python -m vllm.entrypoints.openai.run_batch \
-i ./qr_msgs4infer.jsonl \
-o ./qr_infer_results.jsonl \
--model ./Qwen3-Next-80B-A3B-Thinking \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--enable-prefix-caching \
--enable-chunked-prefill \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
which raised this error: TypeError: argument 'id': StreamInput must be either an integer or a list of integers
Env info
Package Version Editable project location
--------------------------------- ------------------------------- -------------------------
accelerate 1.10.1
aiohappyeyeballs 2.6.1
aiohttp 3.8.6
aiohttp-cors 0.7.0
aiosignal 1.4.0
alabaster 0.7.16
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.10.0
apex 0.1
argon2-cffi 25.1.0
argon2-cffi-bindings 25.1.0
arrow 1.3.0
asn1crypto 1.5.1
astor 0.8.1
asttokens 3.0.0
async-lru 2.0.5
async-timeout 4.0.3
attrs 25.3.0
av 15.1.0
babel 2.17.0
beautifulsoup4 4.13.5
bidict 0.23.1
blake3 1.0.5
bleach 6.2.0
blessed 1.21.0
blinker 1.5
cachetools 5.5.2
causal-conv1d 1.5.2
cbor2 5.7.0
certifi 2025.7.14
cffi 1.17.1
chardet 5.1.0
charset-normalizer 3.4.2
click 8.2.1
cloudpickle 3.1.1
codetiming 1.4.0
colorful 0.5.7
comm 0.2.3
compressed-tensors 0.11.0
configparser 7.2.0
crccheck 1.3.1
crcmod 1.7
crypto 1.4.1
cryptography 43.0.3
cupy-cuda12x 13.6.0
datasets 4.0.0
dbus-python 1.3.2
debugpy 1.8.16
decorator 5.2.1
defusedxml 0.7.1
Deprecated 1.2.18
depyf 0.19.0
devscripts 2.23.4+deb12u2
dill 0.3.8
diskcache 5.6.3
distlib 0.4.0
distro 1.8.0
distro-info 1.5+deb12u1
dnspython 2.7.0
docker-pycreds 0.4.0
docutils 0.19
ecdsa 0.19.1
einops 0.8.1
email-validator 2.3.0
exceptiongroup 1.3.0
executing 2.2.1
fastapi 0.116.1
fastapi-cli 0.0.8
fastapi-cloud-cli 0.1.5
fastjsonschema 2.21.2
fastrlock 0.8.3
filelock 3.18.0
fla-core 0.3.2
flash-attn 2.7.4.post1
flash-linear-attention 0.3.2
flashinfer-python 0.3.1
fqdn 1.5.1
frozendict 2.4.6
frozenlist 1.3.3
fsspec 2024.12.0
gguf 0.17.1
gitdb 4.0.12
GitPython 3.1.45
google-api-core 2.25.1
google-auth 2.40.3
googleapis-common-protos 1.70.0
gpg 1.18.0
gpustat 1.0.0
grpcio 1.59.5
h11 0.16.0
hf-xet 1.1.9
httpcore 1.0.9
httplib2 0.20.4
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.34.4
hydra-core 1.3.2
idna 3.10
imagesize 1.4.1
importlib-metadata 6.7.0
iniconfig 2.1.0
interegular 0.3.3
iotop 0.6
ipaddress 1.0.23
ipykernel 6.30.1
ipython 9.5.0
ipython_pygments_lexers 1.1.1
ipywidgets 8.1.7
isoduration 20.11.0
jedi 0.19.2
Jinja2 3.1.6
jiter 0.10.0
json5 0.12.1
jsonpointer 3.0.0
jsonschema 4.25.1
jsonschema-specifications 2025.4.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.8.1
jupyter-events 0.12.0
jupyter-lsp 2.3.0
jupyter_server 2.17.0
jupyter_server_terminals 0.5.3
jupyterlab 4.4.7
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.15
lazr.restfulclient 0.14.5
lazr.uri 1.0.6
linkify-it-py 2.0.3
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.11.3
markdown-it-py 4.0.0
MarkupSafe 3.0.2
mathruler 0.1.0
matplotlib-inline 0.1.7
mdit-py-plugins 0.5.0
mdurl 0.1.2
megatron-core 0.13.0 /Megatron-LM
memray 1.18.0
mistral_common 1.8.4
mistune 3.1.4
mpmath 1.3.0
msgpack 1.0.8
msgspec 0.19.0
multidict 6.6.4
multiprocess 0.70.16
Naked 0.1.32
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.5
ninja 1.13.0
none 0.1.1
notebook 7.4.5
notebook_shim 0.2.4
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.14.1
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 13.580.82
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12 12.8.90
oauthlib 3.2.2
omegaconf 2.3.0
openai 1.107.1
openai-harmony 0.0.4
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python-headless 4.12.0.88
orjson 3.11.3
outlines_core 0.2.11
overrides 7.7.0
packaging 24.2
pandas 2.3.2
pandocfilters 1.5.1
parso 0.8.5
partial-json-parser 0.2.1.1.post6
pathlib2 2.3.7.post1
pathtools 0.1.2
peft 0.17.1
pexpect 4.9.0
pillow 11.3.0
pip 25.2
platformdirs 4.4.0
pluggy 1.6.0
ply 3.11
prometheus_client 0.22.1
prometheus-fastapi-instrumentator 7.1.0
promise 2.3
prompt_toolkit 3.0.52
propcache 0.3.2
proto-plus 1.26.1
protobuf 3.20.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py 1.11.0
py-cpuinfo 9.0.0
py-spy 0.4.1
pyarrow 19.0.1
pyasn1 0.6.1
pyasn1_modules 0.4.2
pybase64 1.4.2
pybind11 3.0.1
pycountry 24.6.1
pycparser 2.22
pycryptodome 3.18.0
pycryptodomex 3.23.0
pydantic 2.11.7
pydantic_core 2.33.2
pydantic-extra-types 2.10.5
Pygments 2.19.2
PyGObject 3.42.2
PyJWT 2.10.1
pylatexenc 2.10
pynvml 13.0.1
pyOpenSSL 25.1.0
pyparsing 3.0.9
pyrsistent 0.20.0
pytest 6.2.5
python-apt 2.6.0
python-consul 1.1.0
python-dateutil 2.9.0.post0
python-debian 0.1.49
python-dotenv 1.1.1
python-engineio 4.12.2
python-etcd 0.4.5
python-jose 3.5.0
python-json-logger 3.3.0
python-magic 0.4.26
python-multipart 0.0.20
python-socketio 5.13.0
pytz 2025.2
pyvers 0.1.0
pyxdg 0.28
PyYAML 6.0.2
pyzmq 27.0.2
qwen-vl-utils 0.0.11
ray 2.49.1
referencing 0.36.2
regex 2025.8.29
requests 2.32.4
rfc3339-validator 0.1.4
rfc3986 2.0.0
rfc3986-validator 0.1.1
rfc3987-syntax 1.1.0
rich 14.1.0
rich-toolkit 0.15.0
rignore 0.6.4
rpds-py 0.27.1
rsa 4.9.1
safetensors 0.6.2
schedule 1.2.2
scipy 1.16.1
Send2Trash 1.8.3
sentencepiece 0.2.1
sentry-sdk 2.35.1
setproctitle 1.3.6
setuptools 65.7.0
shellescape 3.8.1
shellingham 1.5.4
shortuuid 1.0.13
simple-websocket 1.1.0
six 1.16.0
smart-open 6.4.0
smmap 5.0.2
sniffio 1.3.1
snowballstemmer 3.0.1
soundfile 0.13.1
soupsieve 2.8
soxr 0.5.0.post1
Sphinx 5.3.0
sphinxcontrib-applehelp 2.0.0
sphinxcontrib-devhelp 2.0.0
sphinxcontrib-htmlhelp 2.1.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 2.0.0
sphinxcontrib-serializinghtml 2.0.0
sphinxcontrib-websupport 2.0.0
stack-data 0.6.3
starlette 0.47.3
sympy 1.14.0
tabulate 0.9.0
tensordict 0.9.1
terminado 0.18.1
textual 5.3.0
thriftpy2 0.5.3
tiktoken 0.11.0
tinycss2 1.4.0
tokenizers 0.22.0
toml 0.10.2
torch 2.8.0
torchaudio 2.8.0
torchdata 0.11.0
torchvision 0.23.0
tornado 6.5.2
tox 3.28.0
tqdm 4.67.1
traitlets 5.14.3
transformer-engine 2.2.0+d0c452c
transformers 4.57.0.dev0
triton 3.4.0
typer 0.17.3
types-python-dateutil 2.9.0.20250822
typing_extensions 4.14.1
typing-inspection 0.4.1
tzdata 2025.2
uc-micro-py 1.0.3
unattended-upgrades 0.1
unidiff 0.7.3
uri-template 1.3.0
urllib3 1.26.20
uvicorn 0.29.0
uvloop 0.21.0
virtualenv 20.34.0
vllm 0.10.2rc3.dev7+g561a0baee.cu129
wadllib 1.3.6
watchdog 6.0.0
watchfiles 0.24.0
wcwidth 0.2.13
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 15.0.1
wheel 0.45.1
widgetsnbextension 4.0.14
wrapt 1.17.3
wsproto 1.2.0
xdg 5
xformers 0.0.32.post1
xgrammar 0.1.23
xxhash 3.5.0
yarl 1.20.1
zipp 3.23.0
Full lof
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/async_llm.py", line 392, in generate
out = q.get_nowait() or await q.get()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/output_processor.py", line 59, in get
raise output
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/async_llm.py", line 462, in output_handler
processed_outputs = output_processor.process_outputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/output_processor.py", line 420, in process_outputs
stop_string = req_state.detokenizer.update(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/detokenizer.py", line 118, in update
self.output_text += self.decode_next(new_token_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/detokenizer.py", line 218, in decode_next
token = self._protected_step(next_token_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/detokenizer.py", line 240, in _protected_step
raise e
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/detokenizer.py", line 232, in _protected_step
token = self.stream.step(self.tokenizer, next_token_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument 'id': StreamInput must be either an integer or a list of integers
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/openai/run_batch.py", line 497, in <module>
asyncio.run(main(args))
File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/openai/run_batch.py", line 480, in main
await run_batch(engine_client, vllm_config, args)
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/openai/run_batch.py", line 464, in run_batch
responses = await asyncio.gather(*response_futures)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/openai/run_batch.py", line 284, in run_request
response = await serving_engine_func(request.body)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 330, in create_chat_completion
return await self.chat_completion_full_generator(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 1157, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/async_llm.py", line 425, in generate
raise EngineGenerateError() from e
vllm.v1.engine.exceptions.EngineGenerateError
vLLM-0.10.2 was released, try it! The dev version didn't work for me, but vLLM-0.10.2 works well with cuda 12.8.