NeoChen1024 commited on
Commit
83d6111
·
verified ·
1 Parent(s): 846350e

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ extra_gated_heading: Access Gemma on Hugging Face
6
+ extra_gated_prompt: >-
7
+ To access Gemma on Hugging Face, you’re required to review and agree to
8
+ Google’s usage license. To do this, please ensure you’re logged in to Hugging
9
+ Face and click below. Requests are processed immediately.
10
+ extra_gated_button_content: Acknowledge license
11
+ base_model: google/gemma-3n-E4B-it
12
+ tags:
13
+ - automatic-speech-recognition
14
+ - automatic-speech-translation
15
+ - audio-text-to-text
16
+ - video-text-to-text
17
+ ---
18
+
19
+ # FP8 Dynamic Quantization of Gemma 3n E4B IT (Instruct)
20
+
21
+ > [!Note]
22
+ > This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face `transformers`,
23
+ > supporting text, audio, and vision (image and video) inputs.
24
+ >
25
+ > Gemma 3n models have multiple architecture innovations:
26
+ > * They are available in two sizes based on [effective parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters). While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
27
+ > * They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an [E2B](https://huggingface.co/google/gemma-3n-E2B-it)), or you can access a spectrum of custom-sized models using the [Mix-and-Match method](https://goo.gle/gemma3n-matformer-lab).
28
+ >
29
+ > Learn more about these techniques in the [technical blog post](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide)
30
+ > and the [Gemma documentation](https://ai.google.dev/gemma/docs/gemma-3n).
31
+
32
+ # Gemma 3n model card
33
+
34
+ **Model Page**: [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)
35
+
36
+ **Resources and Technical Documentation**:
37
+
38
+ - [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
39
+ - [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma-3n)
40
+ - [Gemma on HuggingFace](https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4)
41
+ - [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3n)
42
+
43
+ **Terms of Use**: [Terms](https://ai.google.dev/gemma/terms)\
44
+ **Authors**: Google DeepMind
45
+
46
+ ## Model Information
47
+
48
+ Summary description and brief definition of inputs and outputs.
49
+
50
+ ### Description
51
+
52
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
53
+ built from the same research and technology used to create the Gemini models.
54
+ Gemma 3n models are designed for efficient execution on low-resource devices.
55
+ They are capable of multimodal input, handling text, image, video, and audio
56
+ input, and generating text outputs, with open weights for pre-trained and
57
+ instruction-tuned variants. These models were trained with data in over 140
58
+ spoken languages.
59
+
60
+ Gemma 3n models use selective parameter activation technology to reduce resource
61
+ requirements. This technique allows the models to operate at an effective size
62
+ of 2B and 4B parameters, which is lower than the total number of parameters they
63
+ contain. For more information on Gemma 3n's efficient parameter management
64
+ technology, see the
65
+ [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters)
66
+ page.
67
+
68
+ ### Inputs and outputs
69
+
70
+ - **Input:**
71
+ - Text string, such as a question, a prompt, or a document to be
72
+ summarized
73
+ - Images, normalized to 256x256, 512x512, or 768x768 resolution
74
+ and encoded to 256 tokens each
75
+ - Audio data encoded to 6.25 tokens per second from a single channel
76
+ - Total input context of 32K tokens
77
+ - **Output:**
78
+ - Generated text in response to the input, such as an answer to a
79
+ question, analysis of image content, or a summary of a document
80
+ - Total output length up to 32K tokens, subtracting the request
81
+ input tokens
82
+
83
+ ### Usage
84
+
85
+ Below, there are some code snippets on how to get quickly started with running
86
+ the model. First, install the Transformers library. Gemma 3n is supported
87
+ starting from transformers 4.53.0.
88
+
89
+ ```sh
90
+ $ pip install -U transformers
91
+ ```
92
+
93
+ Then, copy the snippet from the section that is relevant for your use case.
94
+
95
+ #### Running with the `pipeline` API
96
+
97
+ You can initialize the model and processor for inference with `pipeline` as
98
+ follows.
99
+
100
+ ```python
101
+ from transformers import pipeline
102
+ import torch
103
+
104
+ pipe = pipeline(
105
+ "image-text-to-text",
106
+ model="google/gemma-3n-e4b-it",
107
+ device="cuda",
108
+ torch_dtype=torch.bfloat16,
109
+ )
110
+ ```
111
+
112
+ With instruction-tuned models, you need to use chat templates to process our
113
+ inputs first. Then, you can pass it to the pipeline.
114
+
115
+ ```python
116
+ messages = [
117
+ {
118
+ "role": "system",
119
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
120
+ },
121
+ {
122
+ "role": "user",
123
+ "content": [
124
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
125
+ {"type": "text", "text": "What animal is on the candy?"}
126
+ ]
127
+ }
128
+ ]
129
+
130
+ output = pipe(text=messages, max_new_tokens=200)
131
+ print(output[0]["generated_text"][-1]["content"])
132
+ # Okay, let's take a look!
133
+ # Based on the image, the animal on the candy is a **turtle**.
134
+ # You can see the shell shape and the head and legs.
135
+ ```
136
+
137
+ #### Running the model on a single GPU
138
+
139
+ ```python
140
+ from transformers import AutoProcessor, Gemma3nForConditionalGeneration
141
+ from PIL import Image
142
+ import requests
143
+ import torch
144
+
145
+ model_id = "google/gemma-3n-e4b-it"
146
+
147
+ model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval()
148
+
149
+ processor = AutoProcessor.from_pretrained(model_id)
150
+
151
+ messages = [
152
+ {
153
+ "role": "system",
154
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
155
+ },
156
+ {
157
+ "role": "user",
158
+ "content": [
159
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
160
+ {"type": "text", "text": "Describe this image in detail."}
161
+ ]
162
+ }
163
+ ]
164
+
165
+ inputs = processor.apply_chat_template(
166
+ messages,
167
+ add_generation_prompt=True,
168
+ tokenize=True,
169
+ return_dict=True,
170
+ return_tensors="pt",
171
+ ).to(model.device)
172
+
173
+ input_len = inputs["input_ids"].shape[-1]
174
+
175
+ with torch.inference_mode():
176
+ generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
177
+ generation = generation[0][input_len:]
178
+
179
+ decoded = processor.decode(generation, skip_special_tokens=True)
180
+ print(decoded)
181
+
182
+ # **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
183
+ # focusing on a cluster of pink cosmos flowers and a busy bumblebee.
184
+ # It has a slightly soft, natural feel, likely captured in daylight.
185
+ ```
186
+
187
+ ### Citation
188
+
189
+ ```
190
+ @article{gemma_3n_2025,
191
+ title={Gemma 3n},
192
+ url={https://ai.google.dev/gemma/docs/gemma-3n},
193
+ publisher={Google DeepMind},
194
+ author={Gemma Team},
195
+ year={2025}
196
+ }
197
+ ```
198
+
199
+ ## Model Data
200
+
201
+ Data used for model training and how the data was processed.
202
+
203
+ ### Training Dataset
204
+
205
+ These models were trained on a dataset that includes a wide variety of sources
206
+ totalling approximately 11 trillion tokens. The knowledge cutoff date for the
207
+ training data was June 2024. Here are the key components:
208
+
209
+ - **Web Documents**: A diverse collection of web text ensures the model
210
+ is exposed to a broad range of linguistic styles, topics, and vocabulary.
211
+ The training dataset includes content in over 140 languages.
212
+ - **Code**: Exposing the model to code helps it to learn the syntax and
213
+ patterns of programming languages, which improves its ability to generate
214
+ code and understand code-related questions.
215
+ - **Mathematics**: Training on mathematical text helps the model learn
216
+ logical reasoning, symbolic representation, and to address mathematical queries.
217
+ - **Images**: A wide range of images enables the model to perform image
218
+ analysis and visual data extraction tasks.
219
+ - Audio: A diverse set of sound samples enables the model to recognize
220
+ speech, transcribe text from recordings, and identify information in audio data.
221
+
222
+ The combination of these diverse data sources is crucial for training a
223
+ powerful multimodal model that can handle a wide variety of different tasks and
224
+ data formats.
225
+
226
+ ### Data Preprocessing
227
+
228
+ Here are the key data cleaning and filtering methods applied to the training
229
+ data:
230
+
231
+ - **CSAM Filtering**: Rigorous CSAM (Child Sexual Abuse Material)
232
+ filtering was applied at multiple stages in the data preparation process to
233
+ ensure the exclusion of harmful and illegal content.
234
+ - **Sensitive Data Filtering**: As part of making Gemma pre-trained models
235
+ safe and reliable, automated techniques were used to filter out certain
236
+ personal information and other sensitive data from training sets.
237
+ - **Additional methods**: Filtering based on content quality and safety in
238
+ line with
239
+ [our policies](https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf).
240
+
241
+ ## Implementation Information
242
+
243
+ Details about the model internals.
244
+
245
+ ### Hardware
246
+
247
+ Gemma was trained using [Tensor Processing Unit
248
+ (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware (TPUv4p, TPUv5p
249
+ and TPUv5e). Training generative models requires significant computational
250
+ power. TPUs, designed specifically for matrix operations common in machine
251
+ learning, offer several advantages in this domain:
252
+
253
+ - **Performance**: TPUs are specifically designed to handle the massive
254
+ computations involved in training generative models. They can speed up
255
+ training considerably compared to CPUs.
256
+ - **Memory**: TPUs often come with large amounts of high-bandwidth memory,
257
+ allowing for the handling of large models and batch sizes during training.
258
+ This can lead to better model quality.
259
+ - **Scalability**: TPU Pods (large clusters of TPUs) provide a scalable
260
+ solution for handling the growing complexity of large foundation models.
261
+ You can distribute training across multiple TPU devices for faster and more
262
+ efficient processing.
263
+ - **Cost-effectiveness**: In many scenarios, TPUs can provide a more
264
+ cost-effective solution for training large models compared to CPU-based
265
+ infrastructure, especially when considering the time and resources saved
266
+ due to faster training.
267
+
268
+ These advantages are aligned with
269
+ [Google's commitments to operate sustainably](https://sustainability.google/operating-sustainably/).
270
+
271
+ ### Software
272
+
273
+ Training was done using [JAX](https://github.com/jax-ml/jax) and
274
+ [ML Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
275
+ JAX allows researchers to take advantage of the latest generation of hardware,
276
+ including TPUs, for faster and more efficient training of large models. ML
277
+ Pathways is Google's latest effort to build artificially intelligent systems
278
+ capable of generalizing across multiple tasks. This is specially suitable for
279
+ foundation models, including large language models like these ones.
280
+
281
+ Together, JAX and ML Pathways are used as described in the
282
+ [paper about the Gemini family of models](https://goo.gle/gemma2report):
283
+ *"the 'single controller' programming model of Jax and Pathways allows a single
284
+ Python process to orchestrate the entire training run, dramatically simplifying
285
+ the development workflow."*
286
+
287
+ ## Evaluation
288
+
289
+ Model evaluation metrics and results.
290
+
291
+ ### Benchmark Results
292
+
293
+ These models were evaluated at full precision (float32) against a large
294
+ collection of different datasets and metrics to cover different aspects of
295
+ content generation. Evaluation results marked with **IT** are for
296
+ instruction-tuned models. Evaluation results marked with **PT** are for
297
+ pre-trained models.
298
+
299
+ #### Reasoning and factuality
300
+
301
+ | Benchmark | Metric | n-shot | E2B PT | E4B PT |
302
+ | ------------------------------ |----------------|----------|:--------:|:--------:|
303
+ | [HellaSwag][hellaswag] | Accuracy | 10-shot | 72.2 | 78.6 |
304
+ | [BoolQ][boolq] | Accuracy | 0-shot | 76.4 | 81.6 |
305
+ | [PIQA][piqa] | Accuracy | 0-shot | 78.9 | 81.0 |
306
+ | [SocialIQA][socialiqa] | Accuracy | 0-shot | 48.8 | 50.0 |
307
+ | [TriviaQA][triviaqa] | Accuracy | 5-shot | 60.8 | 70.2 |
308
+ | [Natural Questions][naturalq] | Accuracy | 5-shot | 15.5 | 20.9 |
309
+ | [ARC-c][arc] | Accuracy | 25-shot | 51.7 | 61.6 |
310
+ | [ARC-e][arc] | Accuracy | 0-shot | 75.8 | 81.6 |
311
+ | [WinoGrande][winogrande] | Accuracy | 5-shot | 66.8 | 71.7 |
312
+ | [BIG-Bench Hard][bbh] | Accuracy | few-shot | 44.3 | 52.9 |
313
+ | [DROP][drop] | Token F1 score | 1-shot | 53.9 | 60.8 |
314
+
315
+ [hellaswag]: https://arxiv.org/abs/1905.07830
316
+ [boolq]: https://arxiv.org/abs/1905.10044
317
+ [piqa]: https://arxiv.org/abs/1911.11641
318
+ [socialiqa]: https://arxiv.org/abs/1904.09728
319
+ [triviaqa]: https://arxiv.org/abs/1705.03551
320
+ [naturalq]: https://github.com/google-research-datasets/natural-questions
321
+ [arc]: https://arxiv.org/abs/1911.01547
322
+ [winogrande]: https://arxiv.org/abs/1907.10641
323
+ [bbh]: https://paperswithcode.com/dataset/bbh
324
+ [drop]: https://arxiv.org/abs/1903.00161
325
+
326
+ #### Multilingual
327
+
328
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
329
+ | ------------------------------------|-------------------------|----------|:--------:|:--------:|
330
+ | [MGSM][mgsm] | Accuracy | 0-shot | 53.1 | 60.7 |
331
+ | [WMT24++][wmt24pp] (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 |
332
+ | [Include][include] | Accuracy | 0-shot | 38.6 | 57.2 |
333
+ | [MMLU][mmlu] (ProX) | Accuracy | 0-shot | 8.1 | 19.9 |
334
+ | [OpenAI MMLU][openai-mmlu] | Accuracy | 0-shot | 22.3 | 35.6 |
335
+ | [Global-MMLU][global-mmlu] | Accuracy | 0-shot | 55.1 | 60.3 |
336
+ | [ECLeKTic][eclektic] | ECLeKTic score | 0-shot | 2.5 | 1.9 |
337
+
338
+ [mgsm]: https://arxiv.org/abs/2210.03057
339
+ [wmt24pp]: https://arxiv.org/abs/2502.12404v1
340
+ [include]:https://arxiv.org/abs/2411.19799
341
+ [mmlu]: https://arxiv.org/abs/2009.03300
342
+ [openai-mmlu]: https://huggingface.co/datasets/openai/MMMLU
343
+ [global-mmlu]: https://huggingface.co/datasets/CohereLabs/Global-MMLU
344
+ [eclektic]: https://arxiv.org/abs/2502.21228
345
+
346
+ #### STEM and code
347
+
348
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
349
+ | ------------------------------------|--------------------------|----------|:--------:|:--------:|
350
+ | [GPQA][gpqa] Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 |
351
+ | [LiveCodeBench][lcb] v5 | pass@1 | 0-shot | 18.6 | 25.7 |
352
+ | Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 |
353
+ | [AIME 2025][aime-2025] | Accuracy | 0-shot | 6.7 | 11.6 |
354
+
355
+ [gpqa]: https://arxiv.org/abs/2311.12022
356
+ [lcb]: https://arxiv.org/abs/2403.07974
357
+ [aime-2025]: https://www.vals.ai/benchmarks/aime-2025-05-09
358
+
359
+ #### Additional benchmarks
360
+
361
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
362
+ | ------------------------------------ |------------|----------|:--------:|:--------:|
363
+ | [MMLU][mmlu] | Accuracy | 0-shot | 60.1 | 64.9 |
364
+ | [MBPP][mbpp] | pass@1 | 3-shot | 56.6 | 63.6 |
365
+ | [HumanEval][humaneval] | pass@1 | 0-shot | 66.5 | 75.0 |
366
+ | [LiveCodeBench][lcb] | pass@1 | 0-shot | 13.2 | 13.2 |
367
+ | HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 |
368
+ | [Global-MMLU-Lite][global-mmlu-lite] | Accuracy | 0-shot | 59.0 | 64.5 |
369
+ | [MMLU][mmlu] (Pro) | Accuracy | 0-shot | 40.5 | 50.6 |
370
+
371
+ [gpqa]: https://arxiv.org/abs/2311.12022
372
+ [mbpp]: https://arxiv.org/abs/2108.07732
373
+ [humaneval]: https://arxiv.org/abs/2107.03374
374
+ [lcb]: https://arxiv.org/abs/2403.07974
375
+ [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite
376
+
377
+ ## Ethics and Safety
378
+
379
+ Ethics and safety evaluation approach and results.
380
+
381
+ ### Evaluation Approach
382
+
383
+ Our evaluation methods include structured evaluations and internal red-teaming
384
+ testing of relevant content policies. Red-teaming was conducted by a number of
385
+ different teams, each with different goals and human evaluation metrics. These
386
+ models were evaluated against a number of different categories relevant to
387
+ ethics and safety, including:
388
+
389
+ - **Child Safety**: Evaluation of text-to-text and image to text prompts
390
+ covering child safety policies, including child sexual abuse and
391
+ exploitation.
392
+ - **Content Safety:** Evaluation of text-to-text and image to text prompts
393
+ covering safety policies including, harassment, violence and gore, and hate
394
+ speech.
395
+ - **Representational Harms**: Evaluation of text-to-text and image to text
396
+ prompts covering safety policies including bias, stereotyping, and harmful
397
+ associations or inaccuracies.
398
+
399
+ In addition to development level evaluations, we conduct "assurance
400
+ evaluations" which are our 'arms-length' internal evaluations for responsibility
401
+ governance decision making. They are conducted separately from the model
402
+ development team, to inform decision making about release. High level findings
403
+ are fed back to the model team, but prompt sets are held-out to prevent
404
+ overfitting and preserve the results' ability to inform decision making. Notable
405
+ assurance evaluation results are reported to our Responsibility & Safety Council
406
+ as part of release review.
407
+
408
+ ### Evaluation Results
409
+
410
+ For all areas of safety testing, we saw safe levels of performance across the
411
+ categories of child safety, content safety, and representational harms relative
412
+ to previous Gemma models. All testing was conducted without safety filters to
413
+ evaluate the model capabilities and behaviors. For text-to-text, image-to-text,
414
+ and audio-to-text, and across all model sizes, the model produced minimal policy
415
+ violations, and showed significant improvements over previous Gemma models'
416
+ performance with respect to high severity violations. A limitation of our
417
+ evaluations was they included primarily English language prompts.
418
+
419
+ ## Usage and Limitations
420
+
421
+ These models have certain limitations that users should be aware of.
422
+
423
+ ### Intended Usage
424
+
425
+ Open generative models have a wide range of applications across various
426
+ industries and domains. The following list of potential uses is not
427
+ comprehensive. The purpose of this list is to provide contextual information
428
+ about the possible use-cases that the model creators considered as part of model
429
+ training and development.
430
+
431
+ - Content Creation and Communication
432
+ - **Text Generation**: Generate creative text formats such as
433
+ poems, scripts, code, marketing copy, and email drafts.
434
+ - **Chatbots and Conversational AI**: Power conversational
435
+ interfaces for customer service, virtual assistants, or interactive
436
+ applications.
437
+ - **Text Summarization**: Generate concise summaries of a text
438
+ corpus, research papers, or reports.
439
+ - **Image Data Extraction**: Extract, interpret, and summarize
440
+ visual data for text communications.
441
+ - **Audio Data Extraction**: Transcribe spoken language, translate speech
442
+ to text in other languages, and analyze sound-based data.
443
+ - Research and Education
444
+ - **Natural Language Processing (NLP) and generative model
445
+ Research**: These models can serve as a foundation for researchers to
446
+ experiment with generative models and NLP techniques, develop
447
+ algorithms, and contribute to the advancement of the field.
448
+ - **Language Learning Tools**: Support interactive language
449
+ learning experiences, aiding in grammar correction or providing writing
450
+ practice.
451
+ - **Knowledge Exploration**: Assist researchers in exploring large
452
+ bodies of data by generating summaries or answering questions about
453
+ specific topics.
454
+
455
+ ### Limitations
456
+
457
+ - Training Data
458
+ - The quality and diversity of the training data significantly
459
+ influence the model's capabilities. Biases or gaps in the training data
460
+ can lead to limitations in the model's responses.
461
+ - The scope of the training dataset determines the subject areas
462
+ the model can handle effectively.
463
+ - Context and Task Complexity
464
+ - Models are better at tasks that can be framed with clear
465
+ prompts and instructions. Open-ended or highly complex tasks might be
466
+ challenging.
467
+ - A model's performance can be influenced by the amount of context
468
+ provided (longer context generally leads to better outputs, up to a
469
+ certain point).
470
+ - Language Ambiguity and Nuance
471
+ - Natural language is inherently complex. Models might struggle
472
+ to grasp subtle nuances, sarcasm, or figurative language.
473
+ - Factual Accuracy
474
+ - Models generate responses based on information they learned
475
+ from their training datasets, but they are not knowledge bases. They
476
+ may generate incorrect or outdated factual statements.
477
+ - Common Sense
478
+ - Models rely on statistical patterns in language. They might
479
+ lack the ability to apply common sense reasoning in certain situations.
480
+
481
+ ### Ethical Considerations and Risks
482
+
483
+ The development of generative models raises several ethical concerns. In
484
+ creating an open model, we have carefully considered the following:
485
+
486
+ - Bias and Fairness
487
+ - Generative models trained on large-scale, real-world text and image data
488
+ can reflect socio-cultural biases embedded in the training material.
489
+ These models underwent careful scrutiny, input data pre-processing
490
+ described and posterior evaluations reported in this card.
491
+ - Misinformation and Misuse
492
+ - Generative models can be misused to generate text that is
493
+ false, misleading, or harmful.
494
+ - Guidelines are provided for responsible use with the model, see the
495
+ [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
496
+ - Transparency and Accountability:
497
+ - This model card summarizes details on the models' architecture,
498
+ capabilities, limitations, and evaluation processes.
499
+ - A responsibly developed open model offers the opportunity to
500
+ share innovation by making generative model technology accessible to
501
+ developers and researchers across the AI ecosystem.
502
+
503
+ Risks identified and mitigations:
504
+
505
+ - **Perpetuation of biases**: It's encouraged to perform continuous monitoring
506
+ (using evaluation metrics, human review) and the exploration of de-biasing
507
+ techniques during model training, fine-tuning, and other use cases.
508
+ - **Generation of harmful content**: Mechanisms and guidelines for content
509
+ safety are essential. Developers are encouraged to exercise caution and
510
+ implement appropriate content safety safeguards based on their specific
511
+ product policies and application use cases.
512
+ - **Misuse for malicious purposes**: Technical limitations and developer
513
+ and end-user education can help mitigate against malicious applications of
514
+ generative models. Educational resources and reporting mechanisms for users
515
+ to flag misuse are provided. Prohibited uses of Gemma models are outlined
516
+ in the
517
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
518
+ - **Privacy violations**: Models were trained on data filtered for removal of
519
+ certain personal information and other sensitive data. Developers are
520
+ encouraged to adhere to privacy regulations with privacy-preserving
521
+ techniques.
522
+
523
+ ### Benefits
524
+
525
+ At the time of release, this family of models provides high-performance open
526
+ generative model implementations designed from the ground up for responsible AI
527
+ development compared to similarly sized models.
528
+
529
+ Using the benchmark evaluation metrics described in this document, these models
530
+ have shown to provide superior performance to other, comparably-sized open model
531
+ alternatives.
chat_template.jinja ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {{ bos_token }}
2
+ {%- if messages[0]['role'] == 'system' -%}
3
+ {%- if messages[0]['content'] is string -%}
4
+ {%- set first_user_prefix = messages[0]['content'] + '
5
+
6
+ ' -%}
7
+ {%- else -%}
8
+ {%- set first_user_prefix = messages[0]['content'][0]['text'] + '
9
+
10
+ ' -%}
11
+ {%- endif -%}
12
+ {%- set loop_messages = messages[1:] -%}
13
+ {%- else -%}
14
+ {%- set first_user_prefix = "" -%}
15
+ {%- set loop_messages = messages -%}
16
+ {%- endif -%}
17
+ {%- for message in loop_messages -%}
18
+ {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
19
+ {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
20
+ {%- endif -%}
21
+ {%- if (message['role'] == 'assistant') -%}
22
+ {%- set role = "model" -%}
23
+ {%- else -%}
24
+ {%- set role = message['role'] -%}
25
+ {%- endif -%}
26
+ {{ '<start_of_turn>' + role + '
27
+ ' + (first_user_prefix if loop.first else "") }}
28
+ {%- if message['content'] is string -%}
29
+ {{ message['content'] | trim }}
30
+ {%- elif message['content'] is iterable -%}
31
+ {%- for item in message['content'] -%}
32
+ {%- if item['type'] == 'audio' -%}
33
+ {{ '<audio_soft_token>' }}
34
+ {%- elif item['type'] == 'image' -%}
35
+ {{ '<image_soft_token>' }}
36
+ {%- elif item['type'] == 'text' -%}
37
+ {{ item['text'] | trim }}
38
+ {%- endif -%}
39
+ {%- endfor -%}
40
+ {%- else -%}
41
+ {{ raise_exception("Invalid content type") }}
42
+ {%- endif -%}
43
+ {{ '<end_of_turn>
44
+ ' }}
45
+ {%- endfor -%}
46
+ {%- if add_generation_prompt -%}
47
+ {{'<start_of_turn>model
48
+ '}}
49
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Gemma3nForConditionalGeneration"
4
+ ],
5
+ "audio_config": {
6
+ "conf_attention_chunk_size": 12,
7
+ "conf_attention_context_left": 13,
8
+ "conf_attention_context_right": 0,
9
+ "conf_attention_logit_cap": 50.0,
10
+ "conf_conv_kernel_size": 5,
11
+ "conf_num_attention_heads": 8,
12
+ "conf_num_hidden_layers": 12,
13
+ "conf_reduction_factor": 4,
14
+ "conf_residual_weight": 0.5,
15
+ "gradient_clipping": 10000000000.0,
16
+ "hidden_size": 1536,
17
+ "input_feat_size": 128,
18
+ "model_type": "gemma3n_audio",
19
+ "rms_norm_eps": 1e-06,
20
+ "sscp_conv_channel_size": [
21
+ 128,
22
+ 32
23
+ ],
24
+ "sscp_conv_group_norm_eps": 0.001,
25
+ "sscp_conv_kernel_size": [
26
+ [
27
+ 3,
28
+ 3
29
+ ],
30
+ [
31
+ 3,
32
+ 3
33
+ ]
34
+ ],
35
+ "sscp_conv_stride_size": [
36
+ [
37
+ 2,
38
+ 2
39
+ ],
40
+ [
41
+ 2,
42
+ 2
43
+ ]
44
+ ],
45
+ "torch_dtype": "bfloat16",
46
+ "vocab_offset": 262272,
47
+ "vocab_size": 128
48
+ },
49
+ "audio_soft_tokens_per_image": 188,
50
+ "audio_token_id": 262273,
51
+ "boa_token_id": 256000,
52
+ "boi_token_id": 255999,
53
+ "eoa_token_id": 262272,
54
+ "eoi_token_id": 262144,
55
+ "eos_token_id": [
56
+ 1,
57
+ 106
58
+ ],
59
+ "image_token_id": 262145,
60
+ "initializer_range": 0.02,
61
+ "model_type": "gemma3n",
62
+ "quantization_config": {
63
+ "config_groups": {
64
+ "group_0": {
65
+ "input_activations": {
66
+ "actorder": null,
67
+ "block_structure": null,
68
+ "dynamic": true,
69
+ "group_size": null,
70
+ "num_bits": 8,
71
+ "observer": null,
72
+ "observer_kwargs": {},
73
+ "strategy": "token",
74
+ "symmetric": true,
75
+ "type": "float"
76
+ },
77
+ "output_activations": null,
78
+ "targets": [
79
+ "Linear"
80
+ ],
81
+ "weights": {
82
+ "actorder": null,
83
+ "block_structure": null,
84
+ "dynamic": false,
85
+ "group_size": null,
86
+ "num_bits": 8,
87
+ "observer": "minmax",
88
+ "observer_kwargs": {},
89
+ "strategy": "channel",
90
+ "symmetric": true,
91
+ "type": "float"
92
+ }
93
+ }
94
+ },
95
+ "format": "float-quantized",
96
+ "global_compression_ratio": null,
97
+ "ignore": [
98
+ "lm_head"
99
+ ],
100
+ "kv_cache_scheme": null,
101
+ "quant_method": "compressed-tensors",
102
+ "quantization_status": "compressed"
103
+ },
104
+ "text_config": {
105
+ "activation_sparsity_pattern": [
106
+ 0.95,
107
+ 0.95,
108
+ 0.95,
109
+ 0.95,
110
+ 0.95,
111
+ 0.95,
112
+ 0.95,
113
+ 0.95,
114
+ 0.95,
115
+ 0.95,
116
+ 0.0,
117
+ 0.0,
118
+ 0.0,
119
+ 0.0,
120
+ 0.0,
121
+ 0.0,
122
+ 0.0,
123
+ 0.0,
124
+ 0.0,
125
+ 0.0,
126
+ 0.0,
127
+ 0.0,
128
+ 0.0,
129
+ 0.0,
130
+ 0.0,
131
+ 0.0,
132
+ 0.0,
133
+ 0.0,
134
+ 0.0,
135
+ 0.0,
136
+ 0.0,
137
+ 0.0,
138
+ 0.0,
139
+ 0.0,
140
+ 0.0
141
+ ],
142
+ "altup_active_idx": 0,
143
+ "altup_coef_clip": 120.0,
144
+ "altup_correct_scale": true,
145
+ "altup_num_inputs": 4,
146
+ "attention_bias": false,
147
+ "attention_dropout": 0.0,
148
+ "final_logit_softcapping": 30.0,
149
+ "head_dim": 256,
150
+ "hidden_activation": "gelu_pytorch_tanh",
151
+ "hidden_size": 2048,
152
+ "hidden_size_per_layer_input": 256,
153
+ "initializer_range": 0.02,
154
+ "intermediate_size": [
155
+ 16384,
156
+ 16384,
157
+ 16384,
158
+ 16384,
159
+ 16384,
160
+ 16384,
161
+ 16384,
162
+ 16384,
163
+ 16384,
164
+ 16384,
165
+ 16384,
166
+ 16384,
167
+ 16384,
168
+ 16384,
169
+ 16384,
170
+ 16384,
171
+ 16384,
172
+ 16384,
173
+ 16384,
174
+ 16384,
175
+ 16384,
176
+ 16384,
177
+ 16384,
178
+ 16384,
179
+ 16384,
180
+ 16384,
181
+ 16384,
182
+ 16384,
183
+ 16384,
184
+ 16384,
185
+ 16384,
186
+ 16384,
187
+ 16384,
188
+ 16384,
189
+ 16384
190
+ ],
191
+ "laurel_rank": 64,
192
+ "layer_types": [
193
+ "sliding_attention",
194
+ "sliding_attention",
195
+ "sliding_attention",
196
+ "sliding_attention",
197
+ "full_attention",
198
+ "sliding_attention",
199
+ "sliding_attention",
200
+ "sliding_attention",
201
+ "sliding_attention",
202
+ "full_attention",
203
+ "sliding_attention",
204
+ "sliding_attention",
205
+ "sliding_attention",
206
+ "sliding_attention",
207
+ "full_attention",
208
+ "sliding_attention",
209
+ "sliding_attention",
210
+ "sliding_attention",
211
+ "sliding_attention",
212
+ "full_attention",
213
+ "sliding_attention",
214
+ "sliding_attention",
215
+ "sliding_attention",
216
+ "sliding_attention",
217
+ "full_attention",
218
+ "sliding_attention",
219
+ "sliding_attention",
220
+ "sliding_attention",
221
+ "sliding_attention",
222
+ "full_attention",
223
+ "sliding_attention",
224
+ "sliding_attention",
225
+ "sliding_attention",
226
+ "sliding_attention",
227
+ "full_attention"
228
+ ],
229
+ "max_position_embeddings": 32768,
230
+ "model_type": "gemma3n_text",
231
+ "num_attention_heads": 8,
232
+ "num_hidden_layers": 35,
233
+ "num_key_value_heads": 2,
234
+ "num_kv_shared_layers": 15,
235
+ "rms_norm_eps": 1e-06,
236
+ "rope_local_base_freq": 10000.0,
237
+ "rope_scaling": null,
238
+ "rope_theta": 1000000.0,
239
+ "sliding_window": 512,
240
+ "torch_dtype": "bfloat16",
241
+ "use_cache": true,
242
+ "vocab_size": 262400,
243
+ "vocab_size_per_layer_input": 262144
244
+ },
245
+ "torch_dtype": "bfloat16",
246
+ "transformers_version": "4.53.2",
247
+ "vision_config": {
248
+ "architecture": "mobilenetv5_300m_enc",
249
+ "do_pooling": false,
250
+ "hidden_size": 2048,
251
+ "initializer_range": 0.02,
252
+ "label_names": [
253
+ "LABEL_0",
254
+ "LABEL_1"
255
+ ],
256
+ "model_args": null,
257
+ "model_type": "gemma3n_vision",
258
+ "num_classes": 2,
259
+ "rms_norm_eps": 1e-06,
260
+ "torch_dtype": "bfloat16",
261
+ "vocab_offset": 262144,
262
+ "vocab_size": 128
263
+ },
264
+ "vision_soft_tokens_per_image": 256
265
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 2,
3
+ "cache_implementation": "hybrid",
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 1,
7
+ 106
8
+ ],
9
+ "pad_token_id": 0,
10
+ "top_k": 64,
11
+ "top_p": 0.95,
12
+ "transformers_version": "4.53.2"
13
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:132afc2c30d644d9601a444b240845d9a19625c757b404f8e7d883ed5da5c41c
3
+ size 4972481528
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a632231b781ada972ee3a753af22ab7f7c6763417616b9400649c189b41d672a
3
+ size 631491424
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ccc1977ed5bee85bf776a112e355e0542de6d1e71401b05e38cb6ade82e2ee0
3
+ size 4998090568
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d30e3ecd98c180f74393ebe9049d979452e4fb6d0e0117cda647a94db579a64
3
+ size 433740000
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
recipe.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ default_stage:
2
+ default_modifiers:
3
+ QuantizationModifier:
4
+ targets: [Linear]
5
+ ignore: [lm_head]
6
+ scheme: FP8_DYNAMIC
special_tokens_map.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_token": "<audio_soft_token>",
3
+ "boa_token": "<start_of_audio>",
4
+ "boi_token": "<start_of_image>",
5
+ "bos_token": {
6
+ "content": "<bos>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eoa_token": "<end_of_audio>",
13
+ "eoi_token": "<end_of_image>",
14
+ "eos_token": {
15
+ "content": "<eos>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "image_token": "<image_soft_token>",
22
+ "pad_token": {
23
+ "content": "<pad>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false
28
+ },
29
+ "unk_token": {
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ }
36
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57cce74cb255a2400a0aaf1be4bb84afcce027f7809d149a5cde1a7a93fa56cd
3
+ size 33442819
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea5f0cc48abfbfc04d14562270a32e02149a3e7035f368cc5a462786f4a59961
3
+ size 4696020
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff