TRAINING DATA

#23

by amanpreet7 - opened Jun 22

Discussion

amanpreet7

Jun 22

•

edited Jun 22

i wanted to ask which or to be precise what datasets did you used to train this llm? i am thrilled to know.

BalakrishnaCh

Google org Jun 23

Hi @amanpreet7 ,

Welcome to Google Gemma family of open source models, Gemma models are trained on large corpus of open source internet data like open source books, novels, blogs, .etc the pre-training of Gemma models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. The increase in tokens accounts for the mix of images and text used during pre-training. We also increase the amount of multilingual data to improve language coverage. We add both monolingual and parallel data, and we handle the imbalance in language representation using a strategy inspired by Chung et al. (2023).

To know more about Gemma model please visit the following page.

Thanks.

amanpreet7

Jun 23

what were names of dataset that you used?

BalakrishnaCh

Google org Jun 24

Hi, The model is trained on large amount of open source blogs, novels, open source resources,.etc. Haven't any info related to specific datasets related. To know more about model related technical info please visit the following document.

Thanks.

DuongLeVan

Sep 20

Hello.
This is the best model when compare to others model in similar size for my language - vietnamese. Thank Google's team very very much.
With many reasons, I understand that it is impossible to describe in details about training dataset, at least until now.
However, is there any way to know whether my data is in trained data or not?
I want to fine-tune it but I don't know this one, it may make me to take a lot of effort and time to collect data but it was trained before.
Thank a lot.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment