Update README.md
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ language:
|
|
| 19 |
|
| 20 |
|
| 21 |
|
| 22 |
-
Calibrated with 30 samples of `mmlu_philosophy`, got eval accuracy of
|
| 23 |
|
| 24 |
|
| 25 |
# Inference with vLLM
|
|
@@ -221,7 +221,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
|
|
| 221 |
| Benchmark | | | |
|
| 222 |
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 223 |
| | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
|
| 224 |
-
| philosophy | 79.42 | 77.17 |
|
| 225 |
|
| 226 |
|
| 227 |
Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
|
|
@@ -255,7 +255,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
| 255 |
| Benchmark | | | |
|
| 256 |
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 257 |
| | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
|
| 258 |
-
| Peak Memory (GB) | 55.01 | 17.21 (69% reduction) |
|
| 259 |
|
| 260 |
Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
|
| 261 |
|
|
@@ -317,8 +317,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 317 |
| Benchmark (Latency) | | | |
|
| 318 |
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 319 |
| | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
|
| 320 |
-
| latency (batch_size=1) | 7.46s | 4.84 (1.54x speedup) |
|
| 321 |
-
| latency (batch_size=256) | 39.55s | 33.30 (1.19x speedup) |
|
| 322 |
|
| 323 |
|
| 324 |
Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
|
| 22 |
+
Calibrated with 30 samples of `mmlu_philosophy`, got eval accuracy of 80.06, while gemma-3-27b-it-INT4 is 77.17, and bfloat16 baseline is 79.42
|
| 23 |
|
| 24 |
|
| 25 |
# Inference with vLLM
|
|
|
|
| 221 |
| Benchmark | | | |
|
| 222 |
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 223 |
| | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
|
| 224 |
+
| philosophy | 79.42 | 77.17 | 80.06 |
|
| 225 |
|
| 226 |
|
| 227 |
Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
|
|
|
|
| 255 |
| Benchmark | | | |
|
| 256 |
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 257 |
| | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
|
| 258 |
+
| Peak Memory (GB) | 55.01 | 17.21 (69% reduction) | 27.66 (50% reduction) |
|
| 259 |
|
| 260 |
Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
|
| 261 |
|
|
|
|
| 317 |
| Benchmark (Latency) | | | |
|
| 318 |
|----------------------------------|------------------------|--------------------------------|---------------------------------|
|
| 319 |
| | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
|
| 320 |
+
| latency (batch_size=1) | 7.46s | 4.84 (1.54x speedup) | 4.92s (1.34x speedup) |
|
| 321 |
+
| latency (batch_size=256) | 39.55s | 33.30 (1.19x speedup) | 28.37s (1.15x speedup) |
|
| 322 |
|
| 323 |
|
| 324 |
Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
|