Can we do this with less active tokens?

by ccocks-deca - opened 18 days ago

Discussion

ccocks-deca

18 days ago

https://huggingface.co/inclusionAI/Ling-1T/blob/main/config.json#L22

Can we set this to 4? How bad would performance be? Kimi K2 only uses 32.

ubergarm

11 days ago

Despite A50B some early users of my quant https://huggingface.co/ubergarm2/Ling-1T-GGUF/discussions/3 and myself are surprised to find it feels almost as fast as say A37B deepseek.

The quants I released shrank the GPU specific tensors from ~20TiB down to ~15TiB to help keep the size of the active parameters low and speed things up a bit. I'd recommend ik_llama.cpp especially for hybrid CPU+GPU inferencing or CPU-only inference situation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment