nvedant07 commited on
Commit
e7b4c95
·
verified ·
1 Parent(s): 6748bec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -93
README.md CHANGED
@@ -3,16 +3,17 @@ language:
3
  - en
4
  - de
5
  license: other
6
- thumbnail: https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-8B-DPO/raw/main/source/aleph_alpha_logo_thumbnail.png
7
  license_name: open-aleph-license
8
  license_link: LICENSE
 
9
  tags:
10
  - Aleph Alpha Research
11
  - pytorch
12
  - Hierarchical Autoregressive Transformer
13
  - HAT
14
  model-index:
15
- - name: TFree-HAT-Pretrained-8B-Base
16
  results: []
17
  ---
18
 
@@ -33,29 +34,31 @@ model-index:
33
  <a href="https://twitter.com/Aleph__Alpha" target="_blank" style="margin: 2px;">
34
  <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-AlephAlpha_Research-white?logo=x&amp;logoColor=white" style="display: inline-block; vertical-align: middle;"/>
35
  </a>
36
- <a href="https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-8B-Base/blob/main/LICENSE" style="margin: 2px;">
37
  <img alt="License" src="https://img.shields.io/badge/License-Open Aleph License-white?&amp;color=white" style="display: inline-block; vertical-align: middle;"/>
38
  </a>
39
  </div>
40
 
41
  <hr>
42
 
43
- # TFree-HAT-Pretrained-8B-Base
44
  <!-- markdownlint-disable first-line-h1 -->
45
  <!-- markdownlint-disable html -->
46
  <!-- markdownlint-disable no-duplicate-header -->
47
 
48
- This model card provides an overview of our **TFree-HAT-Pretrained-8B-Base** model , which is a foundation model developed by Aleph Alpha Research* and publicly available under the Open Aleph License, a license explicitly allowing for non-commercial research and educational use.
 
 
49
 
50
  The model is based on our Hierarchical Autoregressive Transformer (HAT) architecture which is described originally in our [paper](https://arxiv.org/abs/2501.10322). This novel architecture integrates character-level encoding and decoding with the word-level backbone, allowing for improved text compression (less sequence positions) and performance in the languages it has been trained on, and potentially higher robustness to prompt changes, as well as improved adaptability to new languages & domains via fine-tuning.
51
 
52
- The model was pre-trained in English & German on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. It shows strong proficiency in German, while also beating Llama 3.1 on many benchmarks in English.
53
 
54
  You can find model weights and their corresponding safetensors conversions at the following links:
55
 
56
  | Model Name | Description |
57
  | --- | --- |
58
- | `TFree-HAT-Pretrained-8B-Base` | [Link](https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-8B-Base) - pre-trained for English and German, adapted to a maximum context length of 32900 words |
59
 
60
  # Model Access
61
 
@@ -84,13 +87,13 @@ Download model weights and run inference using the following example:
84
  import torch
85
  from transformers import AutoModelForCausalLM
86
  INPUT ="When was Rome founded?"
87
- MODEL_ID = "Aleph-Alpha/TFree-HAT-Pretrained-8B-Base"
88
  model = AutoModelForCausalLM.from_pretrained(
89
  trust_remote_code=True,
90
  pretrained_model_name_or_path=MODEL_ID,
91
  attn_implementation="flash_attention_2",
92
  ).to("cuda", torch.bfloat16)
93
- input_ids, cumulative_word_lengths = model._prepare_input(INPUT)
94
  model_output = model.generate(
95
  input_ids,
96
  cumulative_seq_lengths_per_word=cumulative_word_lengths,
@@ -101,12 +104,16 @@ print("Prompt: ", INPUT)
101
  print("Completion: ", model_output.completion_text)
102
  ```
103
 
104
- Please note that the realized inference speed strongly depends on the maturity of the inference implementation beyond the intrinsic text compression of any model. Besides this huggingface transformers-based inference solution, we are also releasing a [vLLM-based inference solution](https://github.com/Aleph-Alpha/vllm) for our models that is optimized for batched inference. Please note that this vLLM inference for HAT is still under active development.
 
 
 
105
 
 
106
 
107
  # Evaluation
108
 
109
- **Performance**: Our T-Free models deliver performance on par with strong tokenizer-based models such as [Llama 3.1 8B Base](https://huggingface.co/meta-llama/Llama-3.1-8B). Respective benchmarks and results can be found in the tables below.
110
 
111
  **Efficiency**: Our tokenizer-free approach results in improved text compression, providing a foundation for improved efficiency in inference speed. We measure in terms of words processed across all languages and domains. We define the metric as **tokenizer fertility** or **bytes per sequence position**, where a higher value indicates better performance. Latency and throughput are currently out of scope for research-centric evaluations and will be addressed in the future. Currently, our evaluation framework automatically measures **bytes per sequence position** across datasets, allowing us to derive text compression scores and analyze variations across different dataset distributions. The end to end resulting efficiency is depends on the inference implementation beyond the scope of the here provided inference implementation and reported compression scores.
112
 
@@ -127,25 +134,46 @@ Please note that the realized inference speed strongly depends on the maturity o
127
  `CI`: Concordance Index<br>
128
  `ES`: Exponential Similarity
129
 
130
- ## Pre-training Benchmarks
131
-
132
- | Group | Task | Metric Name | Num Fewshot | [TFree-HAT-Pretrained-8B-Base](https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-8B-Base) | [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [TFree-HAT-Pretrained-8B-Base](https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-8B-Base) Compression | [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) Compression |
133
- | --- | --- | --- | --- | --- | --- | --- | --- |
134
- | Knowledge | MMLU | `norm_log_acc` | 5 | 0.664 | **0.668** | **5.184** | 4.278 |
135
- | Knowledge | MMLU Pro | `norm_log_acc` | 5 | **0.386** | 0.367 | **4.734** | 3.731 |
136
- | Knowledge | OpenBookQA | `norm_log_acc` | 10 | 0.360 | **0.366** | **4.982** | 4.724 |
137
- | Knowledge | TriviaQA | `comp_acc` | 10 | 0.658 | **0.695** | **5.317** | 4.221 |
138
- | Knowledge | TruthfulQA | `norm_prob_mass` | 6 | **0.306** | 0.279 | **4.945** | 4.197 |
139
- | Reasoning | ARC Challenge | `norm_log_acc` | 25 | **0.587** | 0.538 | **5.514** | 4.924 |
140
- | Reasoning | Winogrande | `norm_log_acc` | 5 | **0.754** | 0.747 | **5.158** | 4.909 |
141
- | German | MMMLU | `norm_log_acc` | 5 | **0.618** | 0.576 | **6.056** | 3.410 |
142
- | German | WMT16 | `bleu` | 5 | 34.405 | **34.998** | **5.968** | 4.210 |
143
- | German | WMT20 | `bleu` | 5 | **33.240** | 32.892 | **6.269** | 4.222 |
144
- | Math | GSM8K | `comp_acc` | 8 | **0.528** | **0.528** | **3.840** | 3.332 |
145
- | Long context | GSM8K | `comp_acc` | 16 | 0.536 | --- | 3.837 | --- |
146
- | Long context | Long Bench v2 | `norm_log_acc` | 10 | 0.336 | --- | 5.125 | --- |
147
- | Long context German | Long Bench v2 | `norm_log_acc` | 10 | 0.233 | --- | 5.872 | --- |
148
- | Safety | Winogender | `norm_log_acc` | 5 | **0.671** | 0.636 | **5.232** | 4.799 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
  # Training Details
151
 
@@ -157,7 +185,7 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
157
 
158
  ## Encoder module
159
 
160
- | | **8B** |
161
  | --- | --- |
162
  | Number of layers | 6 |
163
  | Number of attention heads | 8 |
@@ -167,14 +195,14 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
167
  | Cross-attention hidden size | 4096 |
168
  | MLP expansion factor | 2.75 |
169
  | MLP type | SwiGLU |
170
- | Sequence length | 262144 |
171
  | Position embeddings | RoPE with base 1e5 |
172
  | Attention type | causal, local with window size 768 |
173
  | QK-norm | disabled |
174
 
175
  ## Backbone module
176
 
177
- | | **8B** |
178
  | --- | --- |
179
  | Number of layers | 32 |
180
  | Number of attention heads | 32 |
@@ -183,14 +211,14 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
183
  | Hidden size | 4096 |
184
  | MLP expansion factor | 3.5 |
185
  | MLP type | SwiGLU |
186
- | Sequence length | 32900 |
187
  | Position embeddings | RoPE with base 5e5 |
188
  | Attention type | causal |
189
  | QK-norm | per head |
190
 
191
  ## Decoder module
192
 
193
- | | **8B** |
194
  | --- | --- |
195
  | Number of layers | 4 |
196
  | Number of attention heads | 8 |
@@ -200,7 +228,7 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
200
  | Cross-attention hidden size | 4096 |
201
  | MLP expansion factor | 2.75 |
202
  | MLP type | SwiGLU |
203
- | Sequence length | 262144 |
204
  | Position embeddings | RoPE with base 1e5 |
205
  | Attention type | causal, local with window size 768 |
206
  | QK-norm | disabled |
@@ -222,64 +250,19 @@ We also merged leading whitespace and trailing punctuation into the words to red
222
 
223
  To improve the processing of code and math documents, we made additional adjustments to the Unicode splitter. First, we split instances of camel cases like FooBar into Foo and Bar. Second, we treated math symbols (again by Unicode standard) as separate words.
224
 
225
- ## Pre-Training
226
-
227
- **Approach**
228
-
229
- We randomly initialized all model parameters. The model was then trained on the next-byte-prediction objective on a large and diverse document corpus (see below). Initially, we trained on sequences up to 3500 words for a total amount of nearly 4T words. We used global batch-size of 1024 (2.5M words) and followed a warmup-stable-decay schedule with a warmup of 5000 steps, a phase of stable learning rate 2e-3 for 945000 steps and inverse-square-root cooldown to learning rate 0 over the last 50000 steps. We employed weight decay of 0.05 for all parameters except for the embedding and normalization parameters. We employed QK-norm per head and attention logit softcapping at 100, which we found to be important for training stability during pretraining.
230
-
231
- We then continued training on sequences of up to 32900 words for another 2500 steps with global batch size 128, totaling to 10.5B words, upweighting longer documents to make use of the extended context. We used warmup-stable-decay learning rate schedule with 500 steps warmup, a phase of stable learning 2e-4, and a final decay to 0 over the last 500 steps. We disabled attention logit softcapping during this long-context adaptation such that it is not required during inference.
232
-
233
- The training was conducted in our [Scaling framework](https://github.com/Aleph-Alpha/scaling).
234
-
235
- **Data sources**
236
-
237
- The model was trained on a filtered subset of diverse corpora of text data including proprietary curated datasets, high-quality web content, public domain sources, German texts, mathematical texts, and programming code. The proportions and sources of data we used in the pre-training were:
238
-
239
- English Language Data (70%)
240
-
241
- - curated web and synthetic data (63%)
242
-
243
- - high quality curated sources such as Wikipedia and public domain books (7%)
244
 
245
- German Language Data (7%)
246
 
247
- - curated web and synthetic data (6.3%)
248
 
249
- - high quality curated sources such as Wikipedia and public domain books (0.7%)
250
 
251
- Mathematical Content (5%)
252
 
253
- - mathematical code and proofs (2%)
254
-
255
- - mathematical word problems and equations (3%)
256
-
257
- Programming Code (18%)
258
-
259
- - general programming code (11%)
260
-
261
- - high-quality and synthetic Python code (7%)
262
-
263
- ## Data curation
264
-
265
- We applied a range of curation techniques, e.g., for German as described in [Aleph-Alpha-GermanWeb](https://huggingface.co/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb). These include but are not limited to:
266
-
267
- - URL filtering. We used a URL filter developed to filter out fraudulent, harmful, and illegal content from an explicit blocklist, e.g., adult websites, or URLs containing words associated with fraudulent, harmful, or adult content.
268
-
269
- - Text extraction. Natural language texts which were embedded HTML and other web programming languages were extracted using the [Resiliparse](https://github.com/chatnoir-eu/chatnoir-resiliparse) text extractor.
270
-
271
- - Language identification. We used a [fastText language classifier](https://fasttext.cc/docs/en/language-identification.html) trained on character n-grams from Wikipedia to identify, retain, and sort texts into English and German.
272
-
273
- - Repetition removal. We applied heuristic methods for detection and removal of repetitions on the line, paragraph, and character level.
274
-
275
- - Document- and line-level filtering. We utilized additional document-level heuristics to ensure documents had reasonable numbers and quality of words, naturalistic symbols-to-words and numbers-to-words ratios, not predominantly made up of bullet points, and a sufficient quantity of real words.
276
-
277
- - Deduplication. Using exact and fuzzy deduplication to remove duplicate documents.
278
-
279
- ## Synthetic data
280
-
281
- We also generated synthetic data by using permissively-licensed LLMs.
282
 
 
283
 
284
  ## Legal Compliance
285
 
@@ -289,13 +272,12 @@ We acknowledge and abide by applicable national and international regulations, i
289
 
290
  ## Compute & Training Efficiency
291
 
292
- The following table shows the compute resources used in the training stages for the 8B models.
293
 
294
  | **Model** | **Training phase** | **GPUs** | **Approximate average power consumption per GPU** | **Approximate GPU hours** |
295
  | --- | --- | --- | --- | --- |
296
- | 8B | Pre-training (part 1) | 256 x H200 | 460W | 111,822 |
297
- | 8B | Pre-training (part 2) | 256 x H100 | 460W | 151,289 |
298
- | 8B | Long context adaptation | 256 x H100 | 190W | 5,328 |
299
 
300
  ## Environmental Impact
301
 
@@ -442,7 +424,13 @@ Some inference parameters, e.g., temperature, lead to the random sampling of out
442
 
443
  This list of risks, biases, and limitations may not be complete, as improving the understanding and behavior of language models is an ongoing research topic in the AI science community.
444
 
 
 
 
 
 
 
445
 
446
  \*Aleph Alpha Research refers to Aleph Alpha Research GmbH
447
 
448
- [hat-paper]: https://arxiv.org/abs/2501.10322
 
3
  - en
4
  - de
5
  license: other
6
+ thumbnail: https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO/raw/main/source/aleph_alpha_logo_thumbnail.png
7
  license_name: open-aleph-license
8
  license_link: LICENSE
9
+ base_model: Aleph-Alpha/TFree-HAT-Pretrained-7B-Base
10
  tags:
11
  - Aleph Alpha Research
12
  - pytorch
13
  - Hierarchical Autoregressive Transformer
14
  - HAT
15
  model-index:
16
+ - name: Llama-TFree-HAT-Pretrained-7B-DPO
17
  results: []
18
  ---
19
 
 
34
  <a href="https://twitter.com/Aleph__Alpha" target="_blank" style="margin: 2px;">
35
  <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-AlephAlpha_Research-white?logo=x&amp;logoColor=white" style="display: inline-block; vertical-align: middle;"/>
36
  </a>
37
+ <a href="https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO/blob/main/LICENSE" style="margin: 2px;">
38
  <img alt="License" src="https://img.shields.io/badge/License-Open Aleph License-white?&amp;color=white" style="display: inline-block; vertical-align: middle;"/>
39
  </a>
40
  </div>
41
 
42
  <hr>
43
 
44
+ # Llama-TFree-HAT-Pretrained-7B-DPO
45
  <!-- markdownlint-disable first-line-h1 -->
46
  <!-- markdownlint-disable html -->
47
  <!-- markdownlint-disable no-duplicate-header -->
48
 
49
+ **NOTE: This model has been pretrained from scratch and finetuned making use of Llama 3.3 for filtering. Adhering to the Llama license, we therefore name the model starting with the llama prefix**
50
+
51
+ This model card provides an overview of our **Llama-TFree-HAT-Pretrained-7B-DPO** model, which is a tokenizer-free (TFree) foundation model developed by Aleph Alpha Research* and publicly available under the Open Aleph License, a license explicitly allowing for non-commercial research and educational use.
52
 
53
  The model is based on our Hierarchical Autoregressive Transformer (HAT) architecture which is described originally in our [paper](https://arxiv.org/abs/2501.10322). This novel architecture integrates character-level encoding and decoding with the word-level backbone, allowing for improved text compression (less sequence positions) and performance in the languages it has been trained on, and potentially higher robustness to prompt changes, as well as improved adaptability to new languages & domains via fine-tuning.
54
 
55
+ The model was initialized from [`TFree-HAT-Pretrained-7B-Base`](https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-7B-Base) and post-trained and direct-preference-optimized in English & German on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. It shows strong proficiency in German, while also beating Llama 3.1 on many benchmarks in English. The direct-preference-optimization of [Llama-TFree-HAT-Pretrained-7B-DPO](https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO) prioritizes helpfulness and instruction following, making the model suitable for sensitive applications without the risk of over-refusal. The model has not been optimized for code generation and math and are thus not evaluated extensively on respective benchmarks.
56
 
57
  You can find model weights and their corresponding safetensors conversions at the following links:
58
 
59
  | Model Name | Description |
60
  | --- | --- |
61
+ | `Llama-TFree-HAT-Pretrained-7B-DPO` | [Link](https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO) - is a is a supervised fine-tuned and direct-preference-optimized `TFree-HAT-Pretrained-7B-Base` |
62
 
63
  # Model Access
64
 
 
87
  import torch
88
  from transformers import AutoModelForCausalLM
89
  INPUT ="When was Rome founded?"
90
+ MODEL_ID = "Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO"
91
  model = AutoModelForCausalLM.from_pretrained(
92
  trust_remote_code=True,
93
  pretrained_model_name_or_path=MODEL_ID,
94
  attn_implementation="flash_attention_2",
95
  ).to("cuda", torch.bfloat16)
96
+ input_ids, cumulative_word_lengths = model._prepare_input(INPUT, add_llama_template=True)
97
  model_output = model.generate(
98
  input_ids,
99
  cumulative_seq_lengths_per_word=cumulative_word_lengths,
 
104
  print("Completion: ", model_output.completion_text)
105
  ```
106
 
107
+ Please note that the realized inference speed strongly depends on the maturity of the inference implementation beyond the intrinsic text compression of any model. Besides this huggingface transformers-based inference solution, we are also releasing a [vLLM-based inference solution](https://github.com/Aleph-Alpha/vllm) for our models that is optimized for batched inference. Please not that this vLLM inference for HAT is still under active development.
108
+
109
+
110
+ ## Prompt formatting
111
 
112
+ The prompt format used for our post-trained model is identical to the [Llama prompt format](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/). We highly recommend using it when prompting the models to ensure optimal performance for the direct-preference-optimized model versions. You can format your prompt in the recommended format by setting `add_llama_template=True` in the `model._prepare_input` method.
113
 
114
  # Evaluation
115
 
116
+ **Performance**: Our T-Free models deliver performance on par with current state-of-the-art OS memory-equivalent models in both English and German. For evaluation purposes, we compare our DPO model with [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Tulu 3.1 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3.1-8B). Respective benchmarks and results can be found in the tables below.
117
 
118
  **Efficiency**: Our tokenizer-free approach results in improved text compression, providing a foundation for improved efficiency in inference speed. We measure in terms of words processed across all languages and domains. We define the metric as **tokenizer fertility** or **bytes per sequence position**, where a higher value indicates better performance. Latency and throughput are currently out of scope for research-centric evaluations and will be addressed in the future. Currently, our evaluation framework automatically measures **bytes per sequence position** across datasets, allowing us to derive text compression scores and analyze variations across different dataset distributions. The end to end resulting efficiency is depends on the inference implementation beyond the scope of the here provided inference implementation and reported compression scores.
119
 
 
134
  `CI`: Concordance Index<br>
135
  `ES`: Exponential Similarity
136
 
137
+
138
+ ## DPO (Post-Training) Benchmarks
139
+
140
+ **MTBench winrates**
141
+
142
+ English/German MTBench numbers are based on datasets created with [FastChat](https://github.com/LumiOpen/FastChat) for the corresponding models.
143
+
144
+ | | **vs.** [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) **(Eng)** | **vs.** [Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) **(Eng)** | **vs.** [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) **(Ger)** | **vs.** [Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) **(Ger)** |
145
+ | --- | --- | --- | --- | --- |
146
+ | Llama-TFree-HAT-Pretrained-7B-DPO | 0.687 | 0.677 | 0.750 | 0.658 |
147
+
148
+ | Group | Task | Metric Name | Num Fewshot | [Llama-TFree-HAT-Pretrained-7B-DPO]((https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO)) | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [Llama-3.1-Tulu-3.1-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3.1-8B) | [Llama-TFree-HAT-Pretrained-7B-DPO]((https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO)) Compression | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) Compression | [Llama-3.1-Tulu-3.1-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3.1-8B) Compression |
149
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
150
+ | Knowledge | MMLU | `norm_log_acc` | 5 | 0.654 | **0.681** | 0.664 | **5.818** | 4.885 | 4.153 |
151
+ | Knowledge | Full Text MMLU | `norm_log_acc` | 5 | 0.658 | **0.680** | 0.677 | **5.849** | 5.075 | 4.408 |
152
+ | Knowledge | MMLU Pro | `norm_log_acc` | 5 | 0.376 | **0.402** | 0.322 | **5.135** | 4.077 | 4.077 |
153
+ | Knowledge | GPQA | `log_acc` | 0 | 0.299 | **0.306** | 0.271 | **5.260** | 3.771 | 3.408 |
154
+ | Knowledge | BBH | `norm_log_acc` | 3 | 0.490 | **0.522** | 0.494 | **5.332** | 4.374 | 3.668 |
155
+ | Knowledge | OpenBookQA | `norm_log_acc` | 10 | 0.418 | 0.526 | **0.528** | **7.101** | 6.973 | 4.041 |
156
+ | Knowledge | TruthfulQA | `norm_prob_mass` | 6 | **0.429** | 0.171 | 0.173 | **6.607** | 5.553 | 3.807 |
157
+ | Reasoning | ARC Easy | `norm_log_acc` | 25 | **0.892** | 0.875 | 0.873 | **7.018** | 6.396 | 4.497 |
158
+ | Reasoning | ARC Challenge | `norm_log_acc` | 25 | **0.655** | 0.638 | 0.650 | **6.860** | 6.218 | 4.522 |
159
+ | Reasoning | Winogrande | `norm_log_acc` | 5 | **0.713** | 0.657 | 0.683 | **6.856** | 6.517 | 4.116 |
160
+ | Reasoning | HellaSwag | `norm_log_acc` | 10 | 0.608 | 0.776 | **0.807** | **5.980** | 5.274 | 4.427 |
161
+ | German | MMMLU | `norm_log_acc` | 5 | **0.610** | 0.590 | 0.572 | **6.630** | 3.912 | 3.383 |
162
+ | German | [ARC Easy DE](https://huggingface.co/datasets/openGPT-X/arcx) | `norm_log_acc` | 25 | **0.823** | 0.729 | 0.751 | **7.872** | 4.910 | 3.607 |
163
+ | German | [ARC Challenge DE](https://huggingface.co/datasets/openGPT-X/arcx) | `norm_log_acc` | 25 | **0.599** | 0.503 | 0.525 | **7.798** | 4.862 | 3.610 |
164
+ | German | [Winogrande DE](https://huggingface.co/datasets/demelin/wino_x) | `norm_log_acc` | 5 | **0.799** | 0.729 | 0.711 | **7.225** | 5.310 | 3.391 |
165
+ | German | [HellaSwag DE](https://huggingface.co/datasets/openGPT-X/hellaswagx) | `norm_log_acc` | 10 | 0.535 | 0.626 | **0.657** | **6.971** | 4.137 | 3.603 |
166
+ | German | [TruthfulQA DE](https://huggingface.co/datasets/openGPT-X/truthfulqax) | `norm_prob_mass` | 6 | **0.420** | 0.168 | 0.171 | **7.394** | 4.581 | 3.276 |
167
+ | German | [GSM8K DE](https://huggingface.co/datasets/openGPT-X/gsm8kx) | `comp_acc` | 8 | 0.574 | 0.201 | **0.724** | **4.84** | 3.320 | 2.963 |
168
+ | German | WMT16 | `bleu` | 3 | 31.205 | 34.224 | 32.912 | **6.811** | 5.061 | 4.000 |
169
+ | German | WMT16 Instruct | `bleu` | 3 | 31.408 | **34.260** | 33.089 | **6.863** | 5.130 | 4.063 |
170
+ | Math | GSM8K | `comp_acc` | 8 | 0.711 | 0.757 | **0.870** | **4.324** | 3.794 | 3.356 |
171
+ | Long context | QuALITY | `log_acc` | 0 | 0.376 | 0.412 | **0.425** | **4.867** | 4.290 | 4.274 |
172
+ | Long context | ZeroSCROLLS MuSiQue | `F1` | 0 | 0.238 | 0.200 | 0.145 | **5.636** | 4.427 | 4.387 |
173
+ | Long context | ZeroSCROLLS Qasper | `F1` | 0 | 0.228 | **0.235** | 0.221 | **5.934** | 4.826 | 4.808 |
174
+ | Long context | ZeroSCROLLS QuALITY | `log_acc` | 0 | 0.667 | 0.810 | 0.714 | **4.565** | 4.230 | 4.215 |
175
+ | Long context | ZeroSCROLLS SpaceDigest | `ES` | 0 | 0.278 | **0.638** | 0.490 | **5.770** | 4.518 | 4.505 |
176
+ | Long context | ZeroSCROLLS SQuALITY | `rouge_gm` | 0 | 0.144 | **0.164** | 0.163 | **4.965** | 4.240 | 4.241 |
177
 
178
  # Training Details
179
 
 
185
 
186
  ## Encoder module
187
 
188
+ | | **119M** |
189
  | --- | --- |
190
  | Number of layers | 6 |
191
  | Number of attention heads | 8 |
 
195
  | Cross-attention hidden size | 4096 |
196
  | MLP expansion factor | 2.75 |
197
  | MLP type | SwiGLU |
198
+ | Sequence length | 163840 |
199
  | Position embeddings | RoPE with base 1e5 |
200
  | Attention type | causal, local with window size 768 |
201
  | QK-norm | disabled |
202
 
203
  ## Backbone module
204
 
205
+ | | **7B** |
206
  | --- | --- |
207
  | Number of layers | 32 |
208
  | Number of attention heads | 32 |
 
211
  | Hidden size | 4096 |
212
  | MLP expansion factor | 3.5 |
213
  | MLP type | SwiGLU |
214
+ | Sequence length | 20480 |
215
  | Position embeddings | RoPE with base 5e5 |
216
  | Attention type | causal |
217
  | QK-norm | per head |
218
 
219
  ## Decoder module
220
 
221
+ | | **94M** |
222
  | --- | --- |
223
  | Number of layers | 4 |
224
  | Number of attention heads | 8 |
 
228
  | Cross-attention hidden size | 4096 |
229
  | MLP expansion factor | 2.75 |
230
  | MLP type | SwiGLU |
231
+ | Sequence length | 163840 |
232
  | Position embeddings | RoPE with base 1e5 |
233
  | Attention type | causal, local with window size 768 |
234
  | QK-norm | disabled |
 
250
 
251
  To improve the processing of code and math documents, we made additional adjustments to the Unicode splitter. First, we split instances of camel cases like FooBar into Foo and Bar. Second, we treated math symbols (again by Unicode standard) as separate words.
252
 
253
+ ## Instruction Fine-tuning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
+ ### Approach
256
 
257
+ We optimized `TFree-HAT-Pretrained-7B-Base` for instruction-following using a standard post-training pipeline. First, we applied supervised fine-tuning (SFT) to train the model on both single-turn and multi-turn (chat) instruction-following tasks. Next, we aligned our model for helpfulness and, in parts, safety using Direct Preference Optimization (DPO).
258
 
259
+ ### Data
260
 
261
+ The data used for instruction fine-tuning is based on a mixture of user prompts and model competitions. The data mixture consists of roughly 2M samples from diverse datasets including but not limited to: specialized reasoning datasets covering mathematics, programming, and logical inference; human feedback focused on helpful and harmless responses; a small curated set for specific response patterns; safety and robustness subsets for appropriate boundaries; collaborative conversational data; multilingual conversation prompts; tabular data reasoning for structured information; and formal mathematics with advanced problems.
262
 
263
+ We synthesized responses to the prompts using Qwen 2.5-32B and Qwen 2.5-72B. Additionally, we improved German performance by translating English prompts using Mistral-Nemo-Instruct-2407, generating the corresponding answers using Mistral-Small-3.1-Instruct, and performing quality filtering using an LLM judge based on Llama-3.3-70B-Instruct. Lastly, we supplemented the synthetic data with proprietary human-generated SFT data as well as further data sources.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
 
265
+ For DPO training, we used a similar dataset of prompts and completions from diverse domains.
266
 
267
  ## Legal Compliance
268
 
 
272
 
273
  ## Compute & Training Efficiency
274
 
275
+ The following table shows the compute resources used in the training stages for the 7B models.
276
 
277
  | **Model** | **Training phase** | **GPUs** | **Approximate average power consumption per GPU** | **Approximate GPU hours** |
278
  | --- | --- | --- | --- | --- |
279
+ | 7B | Long context SFT | 128 x H100 | 160W | 1,500 |
280
+ | 7B | DPO | 128 x H100 | 160W | 1,300 |
 
281
 
282
  ## Environmental Impact
283
 
 
424
 
425
  This list of risks, biases, and limitations may not be complete, as improving the understanding and behavior of language models is an ongoing research topic in the AI science community.
426
 
427
+ # Legal Acknowledgements
428
+
429
+ - **Built with Llama**: Built with Llama: Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. The applicable license agreement can be found under the following link: [Llama 3.1 Community License Agreement ](https://www.llama.com/llama3_1/license/)
430
+
431
+ - **Improved using Qwen**
432
+
433
 
434
  \*Aleph Alpha Research refers to Aleph Alpha Research GmbH
435
 
436
+ [hat-paper]: https://arxiv.org/abs/2501.10322