Generates nonsense when run with latest VLLM with Flashinfer 0.4

#35
by stev236 - opened

It seems like there's a problem with vllm or flashinfer that results in the model generating nonsense if using the flashinfer backend (0.4 up). Unclear if the problem is on the vllm or flashinfer side.

Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.

See https://github.com/vllm-project/vllm/issues/26936

Anybody else noticed that?

Sign up or log in to comment