Benchmarks :)

#1
by sousekd - opened

FYI @anikifoss . EPYC 9355 + RTX 5090, ik_llama:

32K context (f16)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 18.316 447.26 120.846 16.95
8192 2048 8192 21.819 375.45 125.240 16.35
8192 2048 16384 25.541 320.74 129.407 15.83
8192 2048 24576 29.230 280.26 133.137 15.38

64K context (f16)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 -c 65536 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.439 304.77 60.534 16.92
4096 1024 4096 14.382 284.81 61.252 16.72
4096 1024 8192 15.330 267.19 62.561 16.37
4096 1024 12288 16.182 253.12 63.642 16.09
4096 1024 16384 17.538 233.55 64.447 15.89
4096 1024 20480 18.546 220.86 64.860 15.79
4096 1024 24576 19.425 210.86 66.561 15.38
4096 1024 28672 20.291 201.86 67.512 15.17
4096 1024 32768 22.175 184.71 68.685 14.91
4096 1024 36864 23.248 176.19 68.859 14.87
4096 1024 40960 24.295 168.59 69.575 14.72
4096 1024 45056 25.366 161.47 71.286 14.36
4096 1024 49152 26.525 154.42 71.967 14.23
4096 1024 53248 27.597 148.42 72.500 14.12
4096 1024 57344 28.685 142.79 73.670 13.90
4096 1024 61440 29.759 137.64 74.286 13.78

Both -op 27,0,28,0,30,0,31,0 and -rtr params improve TG a tiny bit but make PP worse.

Thanks, these look really fast!..

For comparison, Q8_0, 32K context (f16), same hardware, same command as the first table above:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 23.713 345.47 152.981 13.39
8192 2048 8192 26.992 303.49 157.688 12.99
8192 2048 16384 30.692 266.91 161.631 12.67
8192 2048 24576 34.462 237.71 165.494 12.38

And here 64K context (f16). To fit it in 32GB VRAM, max achievable batch size is -ub 2048:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 16.634 123.12 36.310 14.10
2048 512 2048 16.815 121.79 36.578 14.00
2048 512 4096 17.134 119.53 36.510 14.02
2048 512 6144 17.416 117.60 39.028 13.12

Thanks for sharing! I think I'll need 16 MI50s for MoE offloading to hit those numbers on my system 🤔

Sign up or log in to comment