Benchmarks :)
#1
by
sousekd
- opened
FYI @anikifoss . EPYC 9355 + RTX 5090, ik_llama:
32K context (f16)
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-mla 3 -fa -fmoe \
-amb 512 -b 8192 -ub 8192 \
-ctk f16 -ctv f16 -c 32768 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 18.316 | 447.26 | 120.846 | 16.95 |
| 8192 | 2048 | 8192 | 21.819 | 375.45 | 125.240 | 16.35 |
| 8192 | 2048 | 16384 | 25.541 | 320.74 | 129.407 | 15.83 |
| 8192 | 2048 | 24576 | 29.230 | 280.26 | 133.137 | 15.38 |
64K context (f16)
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-mla 3 -fa -fmoe \
-amb 512 -b 4096 -ub 4096 \
-ctk f16 -ctv f16 -c 65536 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 13.439 | 304.77 | 60.534 | 16.92 |
| 4096 | 1024 | 4096 | 14.382 | 284.81 | 61.252 | 16.72 |
| 4096 | 1024 | 8192 | 15.330 | 267.19 | 62.561 | 16.37 |
| 4096 | 1024 | 12288 | 16.182 | 253.12 | 63.642 | 16.09 |
| 4096 | 1024 | 16384 | 17.538 | 233.55 | 64.447 | 15.89 |
| 4096 | 1024 | 20480 | 18.546 | 220.86 | 64.860 | 15.79 |
| 4096 | 1024 | 24576 | 19.425 | 210.86 | 66.561 | 15.38 |
| 4096 | 1024 | 28672 | 20.291 | 201.86 | 67.512 | 15.17 |
| 4096 | 1024 | 32768 | 22.175 | 184.71 | 68.685 | 14.91 |
| 4096 | 1024 | 36864 | 23.248 | 176.19 | 68.859 | 14.87 |
| 4096 | 1024 | 40960 | 24.295 | 168.59 | 69.575 | 14.72 |
| 4096 | 1024 | 45056 | 25.366 | 161.47 | 71.286 | 14.36 |
| 4096 | 1024 | 49152 | 26.525 | 154.42 | 71.967 | 14.23 |
| 4096 | 1024 | 53248 | 27.597 | 148.42 | 72.500 | 14.12 |
| 4096 | 1024 | 57344 | 28.685 | 142.79 | 73.670 | 13.90 |
| 4096 | 1024 | 61440 | 29.759 | 137.64 | 74.286 | 13.78 |
Both -op 27,0,28,0,30,0,31,0 and -rtr params improve TG a tiny bit but make PP worse.
Thanks, these look really fast!..
For comparison, Q8_0, 32K context (f16), same hardware, same command as the first table above:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 23.713 | 345.47 | 152.981 | 13.39 |
| 8192 | 2048 | 8192 | 26.992 | 303.49 | 157.688 | 12.99 |
| 8192 | 2048 | 16384 | 30.692 | 266.91 | 161.631 | 12.67 |
| 8192 | 2048 | 24576 | 34.462 | 237.71 | 165.494 | 12.38 |
And here 64K context (f16). To fit it in 32GB VRAM, max achievable batch size is -ub 2048:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 16.634 | 123.12 | 36.310 | 14.10 |
| 2048 | 512 | 2048 | 16.815 | 121.79 | 36.578 | 14.00 |
| 2048 | 512 | 4096 | 17.134 | 119.53 | 36.510 | 14.02 |
| 2048 | 512 | 6144 | 17.416 | 117.60 | 39.028 | 13.12 |
Thanks for sharing! I think I'll need 16 MI50s for MoE offloading to hit those numbers on my system 🤔