comfyui

The sampling acceleration from nvfp4 quantization in Krea2 is not significant.

#6
by Aca233 - opened

QQ20260625-164559FP8
QQ20260625-164632
NVFP4

RTX5060 64GB DDR4

Same here RTX5060Ti 16G, 96G RAM

Me,too.nvfp4 even slower than mxfp8. RTX5080,96G RAM

Comfy Org org

Possibly uploaded wrong version of it which doesn't allow the fast nvfp4 matmuls, re-uploaded now. For me it's ~15% faster than fp8 on 5090, not all layers could use nvfp4 matmuls due to rather bad quality loss, so it's not going to be that much faster, still should definitely not be slower at least.

Possibly uploaded wrong version of it which doesn't allow the fast nvfp4 matmuls, re-uploaded now. For me it's ~15% faster than fp8 on 5090, not all layers could use nvfp4 matmuls due to rather bad quality loss, so it's not going to be that much faster, still should definitely not be slower at least.

Thanks for the update! I just used the new nvfp4. Generating that 3840x2160 image went from 20s per step down to 15s per step.

Possibly uploaded wrong version of it which doesn't allow the fast nvfp4 matmuls, re-uploaded now. For me it's ~15% faster than fp8 on 5090, not all layers could use nvfp4 matmuls due to rather bad quality loss, so it's not going to be that much faster, still should definitely not be slower at least.

could we get a raw version too? thanks.

could we get a raw version too? thanks.

nvfp4 raw for what? raw checkpoint in bf16 best for lora training, not for image generation, because has very bad quality outputs. nvfp4 weights of raw model is completely useless.

could we get a raw version too? thanks.

nvfp4 raw for what? raw checkpoint in bf16 best for lora training, not for image generation, because has very bad quality outputs. nvfp4 weights of raw model is completely useless.

Raw with turbo lora at 0.6 strength and 12 steps looks better than turbo checkpoint

Sign up or log in to comment