krishnateja95 commited on
Commit
23b9889
·
verified ·
1 Parent(s): a208864

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +342 -3
README.md CHANGED
@@ -1,3 +1,342 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - fp8
6
+ - quantized
7
+ - llm-compressor
8
+ - compressed-tensors
9
+ - red hat
10
+ base_model:
11
+ - meta-llama/Llama-4-Maverick-17B-128E-Instruct
12
+ ---
13
+
14
+
15
+
16
+ # Llama-4-Maverick-17B-128E-Instruct-block-FP8
17
+
18
+ ## Model Overview
19
+ - **Model Architecture:** Llama4ForConditionalGeneration
20
+ - **Input:** Text, Image
21
+ - **Output:** Text
22
+ - **Model Optimizations:**
23
+ - **Weight quantization:** FP8
24
+ - **Activation quantization:** FP8
25
+ - **Release Date:**
26
+ - **Version:** 1.0
27
+ - **Model Developers:**: Red Hat
28
+
29
+ Quantized version of [meta-llama/Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct).
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing the weights and activations of [meta-llama/Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) to FP8 data type.
34
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
35
+ Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
36
+
37
+
38
+
39
+ ## Deployment
40
+
41
+ ### Use with vLLM
42
+
43
+ 1. Initialize vLLM server:
44
+ ```
45
+ vllm serve RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8 --tensor_parallel_size 8
46
+ ```
47
+
48
+ 2. Send requests to the server:
49
+
50
+ ```python
51
+ from openai import OpenAI
52
+
53
+ # Modify OpenAI's API key and API base to use vLLM's API server.
54
+ openai_api_key = "EMPTY"
55
+ openai_api_base = "http://<your-server-host>:8000/v1"
56
+
57
+ client = OpenAI(
58
+ api_key=openai_api_key,
59
+ base_url=openai_api_base,
60
+ )
61
+
62
+ model = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8"
63
+
64
+ messages = [
65
+ {
66
+ "role": "user",
67
+ "content": [
68
+ {
69
+ "type": "image_url",
70
+ "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
71
+ },
72
+ {"type": "text", "text": "Describe this image."},
73
+ ],
74
+ }
75
+ ]
76
+
77
+ outputs = client.chat.completions.create(
78
+ model=model,
79
+ messages=messages,
80
+ )
81
+
82
+ generated_text = outputs.choices[0].message.content
83
+ print(generated_text)
84
+ ```
85
+
86
+
87
+
88
+
89
+
90
+ ## Creation
91
+
92
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
93
+
94
+ <details>
95
+ <summary>Creation details</summary>
96
+
97
+ ```python
98
+ from transformers import AutoProcessor, LlamaForCausalLM, AutoModelForImageTextToText
99
+
100
+ from llmcompressor import oneshot
101
+ from llmcompressor.modeling import replace_modules_for_calibration
102
+ from llmcompressor.modifiers.quantization import QuantizationModifier
103
+ from llmcompressor.utils import dispatch_for_generation
104
+
105
+ MODEL_ID = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
106
+
107
+ # Load model.
108
+ model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto")
109
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
110
+ model = replace_modules_for_calibration(model)
111
+
112
+ # Configure the quantization algorithm and scheme.
113
+ # In this case, we:
114
+ # * quantize the weights to fp8 with per-block quantization
115
+ # * quantize the activations to fp8 with dynamic token activations
116
+ ecipe = QuantizationModifier(
117
+ targets="Linear",
118
+ scheme="FP8_BLOCK",
119
+ ignore=[
120
+ "re:.*lm_head",
121
+ "re:.*self_attn",
122
+ "re:.*router",
123
+ "re:.*vision_model.*",
124
+ "re:.*multi_modal_projector.*",
125
+ "Llama4TextAttention",
126
+ ],
127
+ )
128
+
129
+ # Apply quantization.
130
+ oneshot(model=model, recipe=recipe)
131
+ dispatch_for_generation(model)
132
+
133
+
134
+ # Save to disk in compressed-tensors format.
135
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
136
+ model.save_pretrained(SAVE_DIR)
137
+ processor.save_pretrained(SAVE_DIR)
138
+ ```
139
+ </details>
140
+
141
+
142
+ ## Evaluation
143
+
144
+ The model was evaluated on the OpenLLM leaderboard task, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
145
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
146
+
147
+ <details>
148
+ <summary>Evaluation details</summary>
149
+
150
+ **Openllm V1**
151
+ ```
152
+ lm_eval \
153
+ --model vllm \
154
+ --model_args pretrained="RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8",dtype=auto,add_bos_token=True,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
155
+ --tasks openllm \
156
+ --write_out \
157
+ --batch_size auto \
158
+ --show_config
159
+ ```
160
+
161
+
162
+ **Openllm V2**
163
+ ```
164
+ lm_eval \
165
+ --model vllm \
166
+ --model_args pretrained="RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.7,disable_log_stats=True,enable_chunked_prefill=True,trust_remote_code=True \
167
+ --tasks leaderboard \
168
+ --apply_chat_template \
169
+ --fewshot_as_multiturn \
170
+ --write_out \
171
+ --batch_size auto \
172
+ --show_config
173
+ ```
174
+
175
+
176
+ **Coding Benchmarks**
177
+
178
+ ```
179
+ evalplus.evaluate --model "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8" \
180
+ --dataset "humaneval" \
181
+ --backend vllm \
182
+ --tp 8 \
183
+ --greedy
184
+ evalplus.evaluate --model "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8" \
185
+ --dataset "mbpp" \
186
+ --backend vllm \
187
+ --tp 8 \
188
+ --greedy
189
+ ```
190
+
191
+ </details>
192
+
193
+
194
+
195
+
196
+
197
+
198
+
199
+
200
+ ### Accuracy
201
+ <table>
202
+ <thead>
203
+ <tr>
204
+ <th>Category</th>
205
+ <th>Metric</th>
206
+ <th>meta-llama/Llama-4-Maverick-17B-128E-Instruct</th>
207
+ <th>RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8</th>
208
+ <th>Recovery (%)</th>
209
+ </tr>
210
+ </thead>
211
+ <tbody>
212
+ <!-- OpenLLM Leaderboard V1 -->
213
+ <tr>
214
+ <td rowspan="7"><b>OpenLLM V1</b></td>
215
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
216
+ <td>73.38</td>
217
+ <td>73.38</td>
218
+ <td>100.00</td>
219
+ </tr>
220
+ <tr>
221
+ <td>GSM8K (Strict-Match, 5-shot)</td>
222
+ <td>93.03</td>
223
+ <td>92.72</td>
224
+ <td>99.67</td>
225
+ </tr>
226
+ <tr>
227
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
228
+ <td>87.39</td>
229
+ <td>87.33</td>
230
+ <td>99.93</td>
231
+ </tr>
232
+ <tr>
233
+ <td>MMLU (Acc, 5-shot)</td>
234
+ <td>86.03</td>
235
+ <td>86.15</td>
236
+ <td>100.13</td>
237
+ </tr>
238
+ <tr>
239
+ <td>TruthfulQA (MC2, 0-shot)</td>
240
+ <td>62.76</td>
241
+ <td>62.90</td>
242
+ <td>100.23</td>
243
+ </tr>
244
+ <tr>
245
+ <td>Winogrande (Acc, 5-shot)</td>
246
+ <td>79.56</td>
247
+ <td>79.40</td>
248
+ <td>99.80</td>
249
+ </tr>
250
+ <tr>
251
+ <td><b>Average Score</b></td>
252
+ <td><b>80.36</b></td>
253
+ <td><b>80.31</b></td>
254
+ <td><b>99.94</b></td>
255
+ </tr>
256
+ <!-- OpenLLM Leaderboard V2 -->
257
+ <tr>
258
+ <td rowspan="7"><b>OpenLLM V2</b></td>
259
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
260
+ <td>89.93</td>
261
+ <td>90.89</td>
262
+ <td>101.07</td>
263
+ </tr>
264
+ <tr>
265
+ <td>BBH (Acc-Norm, 3-shot)</td>
266
+ <td>70.53</td>
267
+ <td>71.03</td>
268
+ <td>100.71</td>
269
+ </tr>
270
+ <tr>
271
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
272
+ <td>64.73</td>
273
+ <td>65.26</td>
274
+ <td>100.82</td>
275
+ </tr>
276
+ <tr>
277
+ <td>GPQA (Acc-Norm, 0-shot)</td>
278
+ <td>31.29</td>
279
+ <td>30.54</td>
280
+ <td>97.59</td>
281
+ </tr>
282
+ <tr>
283
+ <td>MUSR (Acc-Norm, 0-shot)</td>
284
+ <td>46.56</td>
285
+ <td>46.03</td>
286
+ <td>98.86</td>
287
+ </tr>
288
+ <tr>
289
+ <td>MMLU-Pro (Acc, 5-shot)</td>
290
+ <td>64.11</td>
291
+ <td>63.95</td>
292
+ <td>99.75</td>
293
+ </tr>
294
+ <tr>
295
+ <td><b>Average Score</b></td>
296
+ <td><b>61.19</b></td>
297
+ <td><b>61.28</b></td>
298
+ <td><b>100.15</b></td>
299
+ </tr>
300
+ <td rowspan="4" ><strong>Coding</strong>
301
+ </td>
302
+ <td>HumanEval pass@1
303
+ </td>
304
+ <td>abc
305
+ </td>
306
+ <td>ijk
307
+ </td>
308
+ <td>xyz
309
+ </td>
310
+ </tr>
311
+ <tr>
312
+ <td>HumanEval+ pass@1
313
+ </td>
314
+ <td>abc
315
+ </td>
316
+ <td>ijk
317
+ </td>
318
+ <td>xyz
319
+ </td>
320
+ </tr>
321
+ <tr>
322
+ <td>MBPP pass@1
323
+ </td>
324
+ <td>abc
325
+ </td>
326
+ <td>ijk
327
+ </td>
328
+ <td>xyz
329
+ </td>
330
+ </tr>
331
+ <tr>
332
+ <td>MBPP+ pass@1
333
+ </td>
334
+ <td>abc
335
+ </td>
336
+ <td>ijk
337
+ </td>
338
+ <td>xyz
339
+ </td>
340
+ </tr>
341
+ </tbody>
342
+ </table>