Update README.md
Browse files
README.md
CHANGED
|
@@ -39,6 +39,28 @@ language:
|
|
| 39 |
- sr
|
| 40 |
- sv
|
| 41 |
- uk
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
base_model:
|
| 43 |
- BSC-LT/salamandra-7b
|
| 44 |
---
|
|
@@ -198,13 +220,13 @@ The pre-training corpus comprises data from 35 European languages and 92 program
|
|
| 198 |
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 199 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 200 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 201 |
-
|
| 202 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
| 203 |
|
| 204 |
-
](https://arxiv.org/pdf/1803.09010).
|
|
@@ -379,7 +401,7 @@ and public institutions, which can be found in detail in the acknowledgements.
|
|
| 379 |
|
| 380 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
| 381 |
|
| 382 |
-
This work
|
| 383 |
|
| 384 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
| 385 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
|
@@ -1152,4 +1174,4 @@ Technical report coming soon.
|
|
| 1152 |
|:---:|:---:|:---:|
|
| 1153 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
| 1154 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
| 1155 |
-
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|
|
|
|
| 39 |
- sr
|
| 40 |
- sv
|
| 41 |
- uk
|
| 42 |
+
datasets:
|
| 43 |
+
- oscar-corpus/colossal-oscar-1.0
|
| 44 |
+
- HuggingFaceFW/fineweb-edu
|
| 45 |
+
- joelniklaus/eurlex_resources
|
| 46 |
+
- joelito/legal-mc4
|
| 47 |
+
- projecte-aina/CATalog
|
| 48 |
+
- UFRGS/brwac
|
| 49 |
+
- community-datasets/hrwac
|
| 50 |
+
- danish-foundation-models/danish-gigaword
|
| 51 |
+
- HiTZ/euscrawl
|
| 52 |
+
- PleIAs/French-PD-Newspapers
|
| 53 |
+
- PleIAs/French-PD-Books
|
| 54 |
+
- AI-team-UoA/greek_legal_code
|
| 55 |
+
- HiTZ/latxa-corpus-v1.1
|
| 56 |
+
- allenai/peS2o
|
| 57 |
+
- pile-of-law/pile-of-law
|
| 58 |
+
- PORTULAN/parlamento-pt
|
| 59 |
+
- hoskinson-center/proof-pile
|
| 60 |
+
- togethercomputer/RedPajama-Data-1T
|
| 61 |
+
- bigcode/starcoderdata
|
| 62 |
+
- bjoernp/tagesschau-2018-2023
|
| 63 |
+
- EleutherAI/the_pile_deduplicated
|
| 64 |
base_model:
|
| 65 |
- BSC-LT/salamandra-7b
|
| 66 |
---
|
|
|
|
| 220 |
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 221 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 222 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 223 |
+
During the following epochs, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
| 224 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
| 225 |
|
| 226 |
+

|
| 227 |
|
| 228 |
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
| 229 |
+
Following this, Starcoder provides 13,67%, and FineWeb-Edu (350BT subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
|
| 230 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
| 231 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
| 232 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
|
| 368 |
</details>
|
| 369 |
|
| 370 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
| 371 |
+
of the Colossal OSCAR dataset was replaced with FineWeb-Edu (350BT subset), resulting in 2.68T tokens per epoch;
|
| 372 |
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
| 373 |
|
| 374 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
|
|
|
| 401 |
|
| 402 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
| 403 |
|
| 404 |
+
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
|
| 405 |
|
| 406 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
| 407 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
|
|
|
| 1174 |
|:---:|:---:|:---:|
|
| 1175 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
| 1176 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
| 1177 |
+
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|