Benchmarks :)

by sousekd - opened 14 days ago

14 days ago

FYI @anikifoss . EPYC 9355 + RTX 5090, ik_llama:

32K context (f16)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	18.316	447.26	120.846	16.95
8192	2048	8192	21.819	375.45	125.240	16.35
8192	2048	16384	25.541	320.74	129.407	15.83
8192	2048	24576	29.230	280.26	133.137	15.38

64K context (f16)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 -c 65536 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.439	304.77	60.534	16.92
4096	1024	4096	14.382	284.81	61.252	16.72
4096	1024	8192	15.330	267.19	62.561	16.37
4096	1024	12288	16.182	253.12	63.642	16.09
4096	1024	16384	17.538	233.55	64.447	15.89
4096	1024	20480	18.546	220.86	64.860	15.79
4096	1024	24576	19.425	210.86	66.561	15.38
4096	1024	28672	20.291	201.86	67.512	15.17
4096	1024	32768	22.175	184.71	68.685	14.91
4096	1024	36864	23.248	176.19	68.859	14.87
4096	1024	40960	24.295	168.59	69.575	14.72
4096	1024	45056	25.366	161.47	71.286	14.36
4096	1024	49152	26.525	154.42	71.967	14.23
4096	1024	53248	27.597	148.42	72.500	14.12
4096	1024	57344	28.685	142.79	73.670	13.90
4096	1024	61440	29.759	137.64	74.286	13.78

Both -op 27,0,28,0,30,0,31,0 and -rtr params improve TG a tiny bit but make PP worse.

anikifoss

Owner 14 days ago

Thanks, these look really fast!..

sousekd

4 days ago

For comparison, Q8_0, 32K context (f16), same hardware, same command as the first table above:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	23.713	345.47	152.981	13.39
8192	2048	8192	26.992	303.49	157.688	12.99
8192	2048	16384	30.692	266.91	161.631	12.67
8192	2048	24576	34.462	237.71	165.494	12.38

And here 64K context (f16). To fit it in 32GB VRAM, max achievable batch size is -ub 2048:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	16.634	123.12	36.310	14.10
2048	512	2048	16.815	121.79	36.578	14.00
2048	512	4096	17.134	119.53	36.510	14.02
2048	512	6144	17.416	117.60	39.028	13.12

anikifoss

Owner 4 days ago

Thanks for sharing! I think I'll need 16 MI50s for MoE offloading to hit those numbers on my system 🤔

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment