Generates nonsense when run with latest VLLM with Flashinfer 0.4
#35
by
stev236
- opened
It seems like there's a problem with vllm or flashinfer that results in the model generating nonsense if using the flashinfer backend (0.4 up). Unclear if the problem is on the vllm or flashinfer side.
Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.
See https://github.com/vllm-project/vllm/issues/26936
Anybody else noticed that?