TRAINING DATA

#23
by amanpreet7 - opened

i wanted to ask which or to be precise what datasets did you used to train this llm? i am thrilled to know.

Google org

Hi @amanpreet7 ,

Welcome to Google Gemma family of open source models, Gemma models are trained on large corpus of open source internet data like open source books, novels, blogs, .etc the pre-training of Gemma models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. The increase in tokens accounts for the mix of images and text used during pre-training. We also increase the amount of multilingual data to improve language coverage. We add both monolingual and parallel data, and we handle the imbalance in language representation using a strategy inspired by Chung et al. (2023).

To know more about Gemma model please visit the following page.

Thanks.

what were names of dataset that you used?

Google org

Hi, The model is trained on large amount of open source blogs, novels, open source resources,.etc. Haven't any info related to specific datasets related. To know more about model related technical info please visit the following document.

Thanks.

Hello.
This is the best model when compare to others model in similar size for my language - vietnamese. Thank Google's team very very much.
With many reasons, I understand that it is impossible to describe in details about training dataset, at least until now.
However, is there any way to know whether my data is in trained data or not?
I want to fine-tune it but I don't know this one, it may make me to take a lot of effort and time to collect data but it was trained before.
Thank a lot.

Sign up or log in to comment