OSainz commited on
Commit
b580a31
·
verified ·
1 Parent(s): 3928af4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +248 -3
README.md CHANGED
@@ -1,3 +1,248 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers # Specify the library
3
+ datasets:
4
+ - HiTZ/latxa-corpus-v1.1
5
+ language:
6
+ - eu
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ pipeline_tag: text-generation
11
+ model-index:
12
+ - name: Latxa-Llama-3.1-70B-Instruct
13
+ results:
14
+ - task:
15
+ type: multiple-choice
16
+ dataset:
17
+ name: xstory_cloze
18
+ type: XStory
19
+ metrics:
20
+ - name: Accuracy (5-shot)
21
+ type: Accuracy (5-shot)
22
+ value: 77.83
23
+ - task:
24
+ type: multiple-choice
25
+ dataset:
26
+ name: belebele
27
+ type: Belebele
28
+ metrics:
29
+ - name: Accuracy (5-shot)
30
+ type: Accuracy (5-shot)
31
+ value: 91.00
32
+ - task:
33
+ type: multiple_choice
34
+ dataset:
35
+ name: eus_proficiency
36
+ type: EusProficiency
37
+ metrics:
38
+ - name: Accuracy (5-shot)
39
+ type: Accuracy (5-shot)
40
+ value: 68.00
41
+ - task:
42
+ type: multiple_choice
43
+ dataset:
44
+ name: eus_reading
45
+ type: EusReading
46
+ metrics:
47
+ - name: Accuracy (5-shot)
48
+ type: Accuracy (5-shot)
49
+ value: 78.98
50
+ - task:
51
+ type: multiple_choice
52
+ dataset:
53
+ name: eus_trivia
54
+ type: EusTrivia
55
+ metrics:
56
+ - name: Accuracy (5-shot)
57
+ type: Accuracy (5-shot)
58
+ value: 74.17
59
+ - task:
60
+ type: multiple_choice
61
+ dataset:
62
+ name: eus_exams
63
+ type: EusExams
64
+ metrics:
65
+ - name: Accuracy (5-shot)
66
+ type: Accuracy (5-shot)
67
+ value: 71.56
68
+
69
+ license: llama3.1
70
+ base_model:
71
+ - meta-llama/Llama-3.1-70B-Instruct
72
+
73
+ co2_eq_emissions:
74
+ emissions: 1900.8
75
+ source: "CodeCarbon"
76
+ training_type: "pre-training"
77
+ geographical_location: "EU-West"
78
+ hardware_used: "256xA100 GPUs"
79
+ ---
80
+
81
+ # Model Card for HiTZ/Latxa-Llama-3.1-70B-Instruct
82
+
83
+ <p align="center">
84
+ <img src="https://github.com/hitz-zentroa/latxa/blob/b9aa705f60ee2cc03c9ed62fda82a685abb31b07/assets/latxa_round.png?raw=true" style="height: 350px;">
85
+ </p>
86
+
87
+ We introduce Latxa 3.1 70B Instruct, an instructed version of [Latxa](https://aclanthology.org/2024.acl-long.799/). This new Latxa is based on Llama-3.1 (Instruct), which we trained on our Basque corpus (Etxaniz et al., 2024) comprising 4.3M documents and 4.2B tokens using language adaptation techniques (paper in preparation).
88
+ > [!WARNING]
89
+ > DISCLAIMER
90
+ >
91
+ > This model is still under development.
92
+ > Further training details will be released with the corresponding research paper in the near future.
93
+
94
+
95
+ Our preliminary experimentation shows that Latxa 3.1 70B Instruct outperforms Llama-3.1-Instruct by a large margin on Basque standard benchmarks, and particularly, on chat conversations. In addition, we organized a public arena-based evaluation, on which Latxat competed against other baselines and proprietary models such as GPT-4o and Claude Sonnet. The results showed that Latxa ranked 3rd, just behind Claude and GPT-4 and above all the other same-size competitors.
96
+ The official paper is coming soon.
97
+
98
+
99
+ ## Model Details
100
+
101
+ ### Model Description
102
+
103
+ Latxa is a family of Large Language Models (LLM) based on Meta’s LLaMA models. Current LLMs exhibit incredible performance
104
+ for high-resource languages such as English, but, in the case of Basque and other low-resource languages, their performance
105
+ is close to a random guesser. These limitations widen the gap between high- and low-resource languages when it comes to
106
+ digital development. We present Latxa to overcome these limitations and promote the development of LLM-based technology and
107
+ research for the Basque language. Latxa models follow the same architecture as their original counterparts and were further
108
+ trained in [Latxa Corpus v1.1](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1), a high-quality Basque corpora.
109
+
110
+ - **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
111
+ - **Model type:** Language model
112
+ - **Language(s) (NLP):** eu
113
+ - **License:** llama3.1
114
+ - **Parent model:** meta-llama/Llama-3.1-70B-Instruct
115
+ - **Contact:** [email protected]
116
+
117
+
118
+ ### Getting Started
119
+ Use the code below to get started with the model.
120
+
121
+ ```python
122
+ from transformers import pipeline
123
+
124
+ pipe = pipeline('text-generation', model='HiTZ/Latxa-Llama-3.1-70B-Instruct')
125
+
126
+ messages = [
127
+ {'role': 'user', 'content': 'Kaixo!'},
128
+ ]
129
+
130
+ pipe(messages)
131
+
132
+ >>
133
+ [
134
+ {
135
+ 'generated_text': [
136
+ {'role': 'user', 'content': 'Kaixo!'},
137
+ {'role': 'assistant', 'content': 'Kaixo! Zer moduz? Zer behar edo galdetu nahi duzu?'}
138
+ ]
139
+ }
140
+ ]
141
+ ```
142
+
143
+ ## Uses
144
+
145
+ Latxa models are intended to be used with Basque data; for any other language the performance is not guaranteed.
146
+ Same as the original, Latxa inherits the [Llama-3.1 License](https://www.llama.com/llama3_1/license/) which allows for commercial and research use.
147
+
148
+ ### Direct Use
149
+
150
+ Latxa Instruct models are trained to follow instructions or to work as chat assistants.
151
+
152
+ ### Out-of-Scope Use
153
+
154
+ The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations.
155
+ Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
156
+
157
+ ## Bias, Risks, and Limitations
158
+
159
+ In an effort to alleviate the potentially disturbing or harmful content, Latxa has been trained on carefully selected and processed
160
+ data which comes mainly from local media, national/regional newspapers, encyclopedias and blogs (see [Latxa Corpus v1.1](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1)). Still, the
161
+ model is based on Llama 3.1 models and can potentially carry the same bias, risk and limitations.
162
+ Please see the Llama’s Ethical Considerations and Limitations for further information.
163
+
164
+
165
+ ## Training Details
166
+
167
+ > [!WARNING]
168
+ > DISCLAIMER
169
+ >
170
+ > Further training details will be released with the corresponding research paper in the near future.
171
+
172
+
173
+ ## Evaluation
174
+
175
+ We evaluated the models 5-shot settings on multiple-choice tasks. We used the basque partitions of each dataset.
176
+
177
+ The arena results will be released in the future.
178
+
179
+ ### Testing Data, Factors & Metrics
180
+
181
+ #### Testing Data
182
+
183
+ - **Belebele** (Bandarkar et al.): Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. We evaluated the model in a 5-shot fashion.
184
+ - Data card: https://huggingface.co/datasets/facebook/belebele
185
+ - **X-StoryCloze** (Lin et al.): XStoryCloze consists of the professionally translated version of the English StoryCloze dataset to 10 non-English languages. Story Cloze is a commonsense reasoning dataset which consists of choosing the correct ending to a four-sentence story. We evaluated the model in a 5-shot fashion.
186
+ - Data card: https://huggingface.co/datasets/juletxara/xstory_cloze
187
+ - **EusProficiency** (Etxaniz et al., 2024): EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque.
188
+ - Data card: https://huggingface.co/datasets/HiTZ/EusProficiency
189
+ - **EusReading** (Etxaniz et al., 2024): EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the same set of past EGA exams.
190
+ - Data card: https://huggingface.co/datasets/HiTZ/EusReading
191
+ - **EusTrivia** (Etxaniz et al., 2024): EusTrivia consists of 1,715 trivia questions from multiple online sources. 56.3% of the questions are elementary level (grades 3-6), while the rest are considered challenging.
192
+ - Data card: https://huggingface.co/datasets/HiTZ/EusTrivia
193
+ - **EusExams** (Etxaniz et al., 2024): EusExams is a collection of tests designed to prepare individuals for Public Service examinations conducted by several Basque institutions, including the public health system Osakidetza, the Basque Government, the City Councils of Bilbao and Gasteiz, and the University of the Basque Country (UPV/EHU).
194
+ - Data card: https://huggingface.co/datasets/HiTZ/EusExams
195
+
196
+ #### Metrics
197
+
198
+ We use Accuracy, as they are framed as Multiple Choice questions.
199
+
200
+ ### Results
201
+
202
+ | Task | Llama-3.1 8B Instruct | Latxa 3.1 8B Instruct | Llama-3.1 70B Instruct | Latxa 3.1 70B Instruct |
203
+ | :---- | :---: | :---: | :---: | :---: |
204
+ | Belebele | 73.89 | 80.00 | 89.11 | 91.00
205
+ | X-Story Cloze | 61.22 | 71.34 | 69.69 | 77.83 |
206
+ | EusProficiency | 34.13 | 52.83 | 43.59 | 68.00 |
207
+ | EusReading | 49.72 | 62.78 | 72.16 | 78.98 |
208
+ | EusTrivia | 45.01 | 61.05 | 62.51 | 74.17 |
209
+ | EusExams | 46.21 | 56.00 | 63.28 | 71.56 |
210
+
211
+
212
+ ## Environmental Impact
213
+
214
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
215
+
216
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
217
+
218
+ <!-- 62.50h x 256 GPU = 16.000h -->
219
+
220
+ - **Hardware Type:** HPC Cluster, 4 x A100 64Gb nodes x64
221
+ - **Hours used (total GPU hours):** 16000h
222
+ - **Cloud Provider:** CINECA HPC
223
+ - **Compute Region:** Italy
224
+ - **Carbon Emitted:** 1900.8kg CO2 eq
225
+
226
+ ## Acknowledgements
227
+
228
+ This work has been partially supported by the Basque Government (IKER-GAITU project).
229
+
230
+ It has also been partially supported by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project with reference 2022/TL22/00215335.
231
+
232
+ The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2023E01-013.
233
+
234
+
235
+ ## Citation
236
+
237
+ Coming soon.
238
+
239
+ Meanwhile, you can reference:
240
+ ```bibtex
241
+ @misc{etxaniz2024latxa,
242
+ title={{L}atxa: An Open Language Model and Evaluation Suite for {B}asque},
243
+ author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
244
+ year={2024},
245
+ eprint={2403.20266},
246
+ archivePrefix={arXiv},
247
+ primaryClass={cs.CL}
248
+ }