pytorch
/

gemma-3-27b-it-AWQ-INT4

@@ -19,7 +19,7 @@ language:
-Calibrated with 30 samples of `mmlu_philosophy`, got eval accuracy of 81.35, while gemma-3-27b-it-INT4 is 77.17, and bfloat16 baseline is 79.42
 # Inference with vLLM
@@ -221,7 +221,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
 | Benchmark                        |                        |                                |                                 |
 |----------------------------------|------------------------|--------------------------------|---------------------------------|
 |                                  | google/gemma-3-27b-it  | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
-| philosophy                       | 79.42                  |     77.17                      | 81.35                           |
 Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
@@ -255,7 +255,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 | Benchmark                        |                        |                                |                                 |
 |----------------------------------|------------------------|--------------------------------|---------------------------------|
 |                                  | google/gemma-3-27b-it  | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
-| Peak Memory (GB)                 | 55.01	                | 17.21 (69% reduction)          | 28.82 (48% reduction)           |
 Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
@@ -317,8 +317,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 | Benchmark (Latency)              |                        |                                |                                 |
 |----------------------------------|------------------------|--------------------------------|---------------------------------|
 |                                  | google/gemma-3-27b-it  | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
-| latency (batch_size=1)           | 7.46s	                | 4.84 (1.54x speedup)           | 5.55s (1.34x speedup)           |
-| latency (batch_size=256)         | 39.55s	                | 33.30 (1.19x speedup)          | 34.39s (1.15x speedup)          |
 Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4

+Calibrated with 30 samples of `mmlu_philosophy`, got eval accuracy of 80.06, while gemma-3-27b-it-INT4 is 77.17, and bfloat16 baseline is 79.42
 # Inference with vLLM
 | Benchmark                        |                        |                                |                                 |
 |----------------------------------|------------------------|--------------------------------|---------------------------------|
 |                                  | google/gemma-3-27b-it  | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
+| philosophy                       | 79.42                  |     77.17                      | 80.06                           |
 Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
 | Benchmark                        |                        |                                |                                 |
 |----------------------------------|------------------------|--------------------------------|---------------------------------|
 |                                  | google/gemma-3-27b-it  | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
+| Peak Memory (GB)                 | 55.01	                | 17.21 (69% reduction)          | 27.66 (50% reduction)           |
 Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4
 | Benchmark (Latency)              |                        |                                |                                 |
 |----------------------------------|------------------------|--------------------------------|---------------------------------|
 |                                  | google/gemma-3-27b-it  | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 |
+| latency (batch_size=1)           | 7.46s	                | 4.84 (1.54x speedup)           | 4.92s (1.34x speedup)           |
+| latency (batch_size=256)         | 39.55s	                | 33.30 (1.19x speedup)          | 28.37s (1.15x speedup)          |
 Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4