Can we do this with less active tokens?

#3
by ccocks-deca - opened

https://huggingface.co/inclusionAI/Ling-1T/blob/main/config.json#L22

Can we set this to 4? How bad would performance be? Kimi K2 only uses 32.

Despite A50B some early users of my quant https://huggingface.co/ubergarm2/Ling-1T-GGUF/discussions/3 and myself are surprised to find it feels almost as fast as say A37B deepseek.

The quants I released shrank the GPU specific tensors from ~20TiB down to ~15TiB to help keep the size of the active parameters low and speed things up a bit. I'd recommend ik_llama.cpp especially for hybrid CPU+GPU inferencing or CPU-only inference situation.

Sign up or log in to comment