Generates nonsense when run with latest VLLM with Flashinfer 0.4

#35

by stev236 - opened 9 days ago

9 days ago

It seems like there's a problem with vllm or flashinfer that results in the model generating nonsense if using the flashinfer backend (0.4 up). Unclear if the problem is on the vllm or flashinfer side.

Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.

See https://github.com/vllm-project/vllm/issues/26936

Anybody else noticed that?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment