Update README.md
Browse files
README.md
CHANGED
|
@@ -3,16 +3,17 @@ language:
|
|
| 3 |
- en
|
| 4 |
- de
|
| 5 |
license: other
|
| 6 |
-
thumbnail: https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-
|
| 7 |
license_name: open-aleph-license
|
| 8 |
license_link: LICENSE
|
|
|
|
| 9 |
tags:
|
| 10 |
- Aleph Alpha Research
|
| 11 |
- pytorch
|
| 12 |
- Hierarchical Autoregressive Transformer
|
| 13 |
- HAT
|
| 14 |
model-index:
|
| 15 |
-
- name: TFree-HAT-Pretrained-
|
| 16 |
results: []
|
| 17 |
---
|
| 18 |
|
|
@@ -33,29 +34,31 @@ model-index:
|
|
| 33 |
<a href="https://twitter.com/Aleph__Alpha" target="_blank" style="margin: 2px;">
|
| 34 |
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-AlephAlpha_Research-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
| 35 |
</a>
|
| 36 |
-
<a href="https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-
|
| 37 |
<img alt="License" src="https://img.shields.io/badge/License-Open Aleph License-white?&color=white" style="display: inline-block; vertical-align: middle;"/>
|
| 38 |
</a>
|
| 39 |
</div>
|
| 40 |
|
| 41 |
<hr>
|
| 42 |
|
| 43 |
-
# TFree-HAT-Pretrained-
|
| 44 |
<!-- markdownlint-disable first-line-h1 -->
|
| 45 |
<!-- markdownlint-disable html -->
|
| 46 |
<!-- markdownlint-disable no-duplicate-header -->
|
| 47 |
|
| 48 |
-
This model
|
|
|
|
|
|
|
| 49 |
|
| 50 |
The model is based on our Hierarchical Autoregressive Transformer (HAT) architecture which is described originally in our [paper](https://arxiv.org/abs/2501.10322). This novel architecture integrates character-level encoding and decoding with the word-level backbone, allowing for improved text compression (less sequence positions) and performance in the languages it has been trained on, and potentially higher robustness to prompt changes, as well as improved adaptability to new languages & domains via fine-tuning.
|
| 51 |
|
| 52 |
-
The model was
|
| 53 |
|
| 54 |
You can find model weights and their corresponding safetensors conversions at the following links:
|
| 55 |
|
| 56 |
| Model Name | Description |
|
| 57 |
| --- | --- |
|
| 58 |
-
| `TFree-HAT-Pretrained-
|
| 59 |
|
| 60 |
# Model Access
|
| 61 |
|
|
@@ -84,13 +87,13 @@ Download model weights and run inference using the following example:
|
|
| 84 |
import torch
|
| 85 |
from transformers import AutoModelForCausalLM
|
| 86 |
INPUT ="When was Rome founded?"
|
| 87 |
-
MODEL_ID = "Aleph-Alpha/TFree-HAT-Pretrained-
|
| 88 |
model = AutoModelForCausalLM.from_pretrained(
|
| 89 |
trust_remote_code=True,
|
| 90 |
pretrained_model_name_or_path=MODEL_ID,
|
| 91 |
attn_implementation="flash_attention_2",
|
| 92 |
).to("cuda", torch.bfloat16)
|
| 93 |
-
input_ids, cumulative_word_lengths = model._prepare_input(INPUT)
|
| 94 |
model_output = model.generate(
|
| 95 |
input_ids,
|
| 96 |
cumulative_seq_lengths_per_word=cumulative_word_lengths,
|
|
@@ -101,12 +104,16 @@ print("Prompt: ", INPUT)
|
|
| 101 |
print("Completion: ", model_output.completion_text)
|
| 102 |
```
|
| 103 |
|
| 104 |
-
Please note that the realized inference speed strongly depends on the maturity of the inference implementation beyond the intrinsic text compression of any model. Besides this huggingface transformers-based inference solution, we are also releasing a [vLLM-based inference solution](https://github.com/Aleph-Alpha/vllm) for our models that is optimized for batched inference. Please
|
|
|
|
|
|
|
|
|
|
| 105 |
|
|
|
|
| 106 |
|
| 107 |
# Evaluation
|
| 108 |
|
| 109 |
-
**Performance**: Our T-Free models deliver performance on par with
|
| 110 |
|
| 111 |
**Efficiency**: Our tokenizer-free approach results in improved text compression, providing a foundation for improved efficiency in inference speed. We measure in terms of words processed across all languages and domains. We define the metric as **tokenizer fertility** or **bytes per sequence position**, where a higher value indicates better performance. Latency and throughput are currently out of scope for research-centric evaluations and will be addressed in the future. Currently, our evaluation framework automatically measures **bytes per sequence position** across datasets, allowing us to derive text compression scores and analyze variations across different dataset distributions. The end to end resulting efficiency is depends on the inference implementation beyond the scope of the here provided inference implementation and reported compression scores.
|
| 112 |
|
|
@@ -127,25 +134,46 @@ Please note that the realized inference speed strongly depends on the maturity o
|
|
| 127 |
`CI`: Concordance Index<br>
|
| 128 |
`ES`: Exponential Similarity
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
|
| 138 |
-
|
|
| 139 |
-
|
|
| 140 |
-
|
| 141 |
-
|
|
| 142 |
-
|
|
| 143 |
-
|
|
| 144 |
-
|
|
| 145 |
-
|
|
| 146 |
-
|
|
| 147 |
-
|
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
# Training Details
|
| 151 |
|
|
@@ -157,7 +185,7 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
|
|
| 157 |
|
| 158 |
## Encoder module
|
| 159 |
|
| 160 |
-
| | **
|
| 161 |
| --- | --- |
|
| 162 |
| Number of layers | 6 |
|
| 163 |
| Number of attention heads | 8 |
|
|
@@ -167,14 +195,14 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
|
|
| 167 |
| Cross-attention hidden size | 4096 |
|
| 168 |
| MLP expansion factor | 2.75 |
|
| 169 |
| MLP type | SwiGLU |
|
| 170 |
-
| Sequence length |
|
| 171 |
| Position embeddings | RoPE with base 1e5 |
|
| 172 |
| Attention type | causal, local with window size 768 |
|
| 173 |
| QK-norm | disabled |
|
| 174 |
|
| 175 |
## Backbone module
|
| 176 |
|
| 177 |
-
| | **
|
| 178 |
| --- | --- |
|
| 179 |
| Number of layers | 32 |
|
| 180 |
| Number of attention heads | 32 |
|
|
@@ -183,14 +211,14 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
|
|
| 183 |
| Hidden size | 4096 |
|
| 184 |
| MLP expansion factor | 3.5 |
|
| 185 |
| MLP type | SwiGLU |
|
| 186 |
-
| Sequence length |
|
| 187 |
| Position embeddings | RoPE with base 5e5 |
|
| 188 |
| Attention type | causal |
|
| 189 |
| QK-norm | per head |
|
| 190 |
|
| 191 |
## Decoder module
|
| 192 |
|
| 193 |
-
| | **
|
| 194 |
| --- | --- |
|
| 195 |
| Number of layers | 4 |
|
| 196 |
| Number of attention heads | 8 |
|
|
@@ -200,7 +228,7 @@ The encoder processes input text as a sequence of UTF-8 bytes and produces a seq
|
|
| 200 |
| Cross-attention hidden size | 4096 |
|
| 201 |
| MLP expansion factor | 2.75 |
|
| 202 |
| MLP type | SwiGLU |
|
| 203 |
-
| Sequence length |
|
| 204 |
| Position embeddings | RoPE with base 1e5 |
|
| 205 |
| Attention type | causal, local with window size 768 |
|
| 206 |
| QK-norm | disabled |
|
|
@@ -222,64 +250,19 @@ We also merged leading whitespace and trailing punctuation into the words to red
|
|
| 222 |
|
| 223 |
To improve the processing of code and math documents, we made additional adjustments to the Unicode splitter. First, we split instances of camel cases like FooBar into Foo and Bar. Second, we treated math symbols (again by Unicode standard) as separate words.
|
| 224 |
|
| 225 |
-
##
|
| 226 |
-
|
| 227 |
-
**Approach**
|
| 228 |
-
|
| 229 |
-
We randomly initialized all model parameters. The model was then trained on the next-byte-prediction objective on a large and diverse document corpus (see below). Initially, we trained on sequences up to 3500 words for a total amount of nearly 4T words. We used global batch-size of 1024 (2.5M words) and followed a warmup-stable-decay schedule with a warmup of 5000 steps, a phase of stable learning rate 2e-3 for 945000 steps and inverse-square-root cooldown to learning rate 0 over the last 50000 steps. We employed weight decay of 0.05 for all parameters except for the embedding and normalization parameters. We employed QK-norm per head and attention logit softcapping at 100, which we found to be important for training stability during pretraining.
|
| 230 |
-
|
| 231 |
-
We then continued training on sequences of up to 32900 words for another 2500 steps with global batch size 128, totaling to 10.5B words, upweighting longer documents to make use of the extended context. We used warmup-stable-decay learning rate schedule with 500 steps warmup, a phase of stable learning 2e-4, and a final decay to 0 over the last 500 steps. We disabled attention logit softcapping during this long-context adaptation such that it is not required during inference.
|
| 232 |
-
|
| 233 |
-
The training was conducted in our [Scaling framework](https://github.com/Aleph-Alpha/scaling).
|
| 234 |
-
|
| 235 |
-
**Data sources**
|
| 236 |
-
|
| 237 |
-
The model was trained on a filtered subset of diverse corpora of text data including proprietary curated datasets, high-quality web content, public domain sources, German texts, mathematical texts, and programming code. The proportions and sources of data we used in the pre-training were:
|
| 238 |
-
|
| 239 |
-
English Language Data (70%)
|
| 240 |
-
|
| 241 |
-
- curated web and synthetic data (63%)
|
| 242 |
-
|
| 243 |
-
- high quality curated sources such as Wikipedia and public domain books (7%)
|
| 244 |
|
| 245 |
-
|
| 246 |
|
| 247 |
-
-
|
| 248 |
|
| 249 |
-
|
| 250 |
|
| 251 |
-
|
| 252 |
|
| 253 |
-
-
|
| 254 |
-
|
| 255 |
-
- mathematical word problems and equations (3%)
|
| 256 |
-
|
| 257 |
-
Programming Code (18%)
|
| 258 |
-
|
| 259 |
-
- general programming code (11%)
|
| 260 |
-
|
| 261 |
-
- high-quality and synthetic Python code (7%)
|
| 262 |
-
|
| 263 |
-
## Data curation
|
| 264 |
-
|
| 265 |
-
We applied a range of curation techniques, e.g., for German as described in [Aleph-Alpha-GermanWeb](https://huggingface.co/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb). These include but are not limited to:
|
| 266 |
-
|
| 267 |
-
- URL filtering. We used a URL filter developed to filter out fraudulent, harmful, and illegal content from an explicit blocklist, e.g., adult websites, or URLs containing words associated with fraudulent, harmful, or adult content.
|
| 268 |
-
|
| 269 |
-
- Text extraction. Natural language texts which were embedded HTML and other web programming languages were extracted using the [Resiliparse](https://github.com/chatnoir-eu/chatnoir-resiliparse) text extractor.
|
| 270 |
-
|
| 271 |
-
- Language identification. We used a [fastText language classifier](https://fasttext.cc/docs/en/language-identification.html) trained on character n-grams from Wikipedia to identify, retain, and sort texts into English and German.
|
| 272 |
-
|
| 273 |
-
- Repetition removal. We applied heuristic methods for detection and removal of repetitions on the line, paragraph, and character level.
|
| 274 |
-
|
| 275 |
-
- Document- and line-level filtering. We utilized additional document-level heuristics to ensure documents had reasonable numbers and quality of words, naturalistic symbols-to-words and numbers-to-words ratios, not predominantly made up of bullet points, and a sufficient quantity of real words.
|
| 276 |
-
|
| 277 |
-
- Deduplication. Using exact and fuzzy deduplication to remove duplicate documents.
|
| 278 |
-
|
| 279 |
-
## Synthetic data
|
| 280 |
-
|
| 281 |
-
We also generated synthetic data by using permissively-licensed LLMs.
|
| 282 |
|
|
|
|
| 283 |
|
| 284 |
## Legal Compliance
|
| 285 |
|
|
@@ -289,13 +272,12 @@ We acknowledge and abide by applicable national and international regulations, i
|
|
| 289 |
|
| 290 |
## Compute & Training Efficiency
|
| 291 |
|
| 292 |
-
The following table shows the compute resources used in the training stages for the
|
| 293 |
|
| 294 |
| **Model** | **Training phase** | **GPUs** | **Approximate average power consumption per GPU** | **Approximate GPU hours** |
|
| 295 |
| --- | --- | --- | --- | --- |
|
| 296 |
-
|
|
| 297 |
-
|
|
| 298 |
-
| 8B | Long context adaptation | 256 x H100 | 190W | 5,328 |
|
| 299 |
|
| 300 |
## Environmental Impact
|
| 301 |
|
|
@@ -442,7 +424,13 @@ Some inference parameters, e.g., temperature, lead to the random sampling of out
|
|
| 442 |
|
| 443 |
This list of risks, biases, and limitations may not be complete, as improving the understanding and behavior of language models is an ongoing research topic in the AI science community.
|
| 444 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 445 |
|
| 446 |
\*Aleph Alpha Research refers to Aleph Alpha Research GmbH
|
| 447 |
|
| 448 |
-
[hat-paper]: https://arxiv.org/abs/2501.10322
|
|
|
|
| 3 |
- en
|
| 4 |
- de
|
| 5 |
license: other
|
| 6 |
+
thumbnail: https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO/raw/main/source/aleph_alpha_logo_thumbnail.png
|
| 7 |
license_name: open-aleph-license
|
| 8 |
license_link: LICENSE
|
| 9 |
+
base_model: Aleph-Alpha/TFree-HAT-Pretrained-7B-Base
|
| 10 |
tags:
|
| 11 |
- Aleph Alpha Research
|
| 12 |
- pytorch
|
| 13 |
- Hierarchical Autoregressive Transformer
|
| 14 |
- HAT
|
| 15 |
model-index:
|
| 16 |
+
- name: Llama-TFree-HAT-Pretrained-7B-DPO
|
| 17 |
results: []
|
| 18 |
---
|
| 19 |
|
|
|
|
| 34 |
<a href="https://twitter.com/Aleph__Alpha" target="_blank" style="margin: 2px;">
|
| 35 |
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-AlephAlpha_Research-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
| 36 |
</a>
|
| 37 |
+
<a href="https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO/blob/main/LICENSE" style="margin: 2px;">
|
| 38 |
<img alt="License" src="https://img.shields.io/badge/License-Open Aleph License-white?&color=white" style="display: inline-block; vertical-align: middle;"/>
|
| 39 |
</a>
|
| 40 |
</div>
|
| 41 |
|
| 42 |
<hr>
|
| 43 |
|
| 44 |
+
# Llama-TFree-HAT-Pretrained-7B-DPO
|
| 45 |
<!-- markdownlint-disable first-line-h1 -->
|
| 46 |
<!-- markdownlint-disable html -->
|
| 47 |
<!-- markdownlint-disable no-duplicate-header -->
|
| 48 |
|
| 49 |
+
**NOTE: This model has been pretrained from scratch and finetuned making use of Llama 3.3 for filtering. Adhering to the Llama license, we therefore name the model starting with the llama prefix**
|
| 50 |
+
|
| 51 |
+
This model card provides an overview of our **Llama-TFree-HAT-Pretrained-7B-DPO** model, which is a tokenizer-free (TFree) foundation model developed by Aleph Alpha Research* and publicly available under the Open Aleph License, a license explicitly allowing for non-commercial research and educational use.
|
| 52 |
|
| 53 |
The model is based on our Hierarchical Autoregressive Transformer (HAT) architecture which is described originally in our [paper](https://arxiv.org/abs/2501.10322). This novel architecture integrates character-level encoding and decoding with the word-level backbone, allowing for improved text compression (less sequence positions) and performance in the languages it has been trained on, and potentially higher robustness to prompt changes, as well as improved adaptability to new languages & domains via fine-tuning.
|
| 54 |
|
| 55 |
+
The model was initialized from [`TFree-HAT-Pretrained-7B-Base`](https://huggingface.co/Aleph-Alpha/TFree-HAT-Pretrained-7B-Base) and post-trained and direct-preference-optimized in English & German on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. It shows strong proficiency in German, while also beating Llama 3.1 on many benchmarks in English. The direct-preference-optimization of [Llama-TFree-HAT-Pretrained-7B-DPO](https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO) prioritizes helpfulness and instruction following, making the model suitable for sensitive applications without the risk of over-refusal. The model has not been optimized for code generation and math and are thus not evaluated extensively on respective benchmarks.
|
| 56 |
|
| 57 |
You can find model weights and their corresponding safetensors conversions at the following links:
|
| 58 |
|
| 59 |
| Model Name | Description |
|
| 60 |
| --- | --- |
|
| 61 |
+
| `Llama-TFree-HAT-Pretrained-7B-DPO` | [Link](https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO) - is a is a supervised fine-tuned and direct-preference-optimized `TFree-HAT-Pretrained-7B-Base` |
|
| 62 |
|
| 63 |
# Model Access
|
| 64 |
|
|
|
|
| 87 |
import torch
|
| 88 |
from transformers import AutoModelForCausalLM
|
| 89 |
INPUT ="When was Rome founded?"
|
| 90 |
+
MODEL_ID = "Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO"
|
| 91 |
model = AutoModelForCausalLM.from_pretrained(
|
| 92 |
trust_remote_code=True,
|
| 93 |
pretrained_model_name_or_path=MODEL_ID,
|
| 94 |
attn_implementation="flash_attention_2",
|
| 95 |
).to("cuda", torch.bfloat16)
|
| 96 |
+
input_ids, cumulative_word_lengths = model._prepare_input(INPUT, add_llama_template=True)
|
| 97 |
model_output = model.generate(
|
| 98 |
input_ids,
|
| 99 |
cumulative_seq_lengths_per_word=cumulative_word_lengths,
|
|
|
|
| 104 |
print("Completion: ", model_output.completion_text)
|
| 105 |
```
|
| 106 |
|
| 107 |
+
Please note that the realized inference speed strongly depends on the maturity of the inference implementation beyond the intrinsic text compression of any model. Besides this huggingface transformers-based inference solution, we are also releasing a [vLLM-based inference solution](https://github.com/Aleph-Alpha/vllm) for our models that is optimized for batched inference. Please not that this vLLM inference for HAT is still under active development.
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
## Prompt formatting
|
| 111 |
|
| 112 |
+
The prompt format used for our post-trained model is identical to the [Llama prompt format](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/). We highly recommend using it when prompting the models to ensure optimal performance for the direct-preference-optimized model versions. You can format your prompt in the recommended format by setting `add_llama_template=True` in the `model._prepare_input` method.
|
| 113 |
|
| 114 |
# Evaluation
|
| 115 |
|
| 116 |
+
**Performance**: Our T-Free models deliver performance on par with current state-of-the-art OS memory-equivalent models in both English and German. For evaluation purposes, we compare our DPO model with [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Tulu 3.1 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3.1-8B). Respective benchmarks and results can be found in the tables below.
|
| 117 |
|
| 118 |
**Efficiency**: Our tokenizer-free approach results in improved text compression, providing a foundation for improved efficiency in inference speed. We measure in terms of words processed across all languages and domains. We define the metric as **tokenizer fertility** or **bytes per sequence position**, where a higher value indicates better performance. Latency and throughput are currently out of scope for research-centric evaluations and will be addressed in the future. Currently, our evaluation framework automatically measures **bytes per sequence position** across datasets, allowing us to derive text compression scores and analyze variations across different dataset distributions. The end to end resulting efficiency is depends on the inference implementation beyond the scope of the here provided inference implementation and reported compression scores.
|
| 119 |
|
|
|
|
| 134 |
`CI`: Concordance Index<br>
|
| 135 |
`ES`: Exponential Similarity
|
| 136 |
|
| 137 |
+
|
| 138 |
+
## DPO (Post-Training) Benchmarks
|
| 139 |
+
|
| 140 |
+
**MTBench winrates**
|
| 141 |
+
|
| 142 |
+
English/German MTBench numbers are based on datasets created with [FastChat](https://github.com/LumiOpen/FastChat) for the corresponding models.
|
| 143 |
+
|
| 144 |
+
| | **vs.** [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) **(Eng)** | **vs.** [Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) **(Eng)** | **vs.** [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) **(Ger)** | **vs.** [Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) **(Ger)** |
|
| 145 |
+
| --- | --- | --- | --- | --- |
|
| 146 |
+
| Llama-TFree-HAT-Pretrained-7B-DPO | 0.687 | 0.677 | 0.750 | 0.658 |
|
| 147 |
+
|
| 148 |
+
| Group | Task | Metric Name | Num Fewshot | [Llama-TFree-HAT-Pretrained-7B-DPO]((https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO)) | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [Llama-3.1-Tulu-3.1-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3.1-8B) | [Llama-TFree-HAT-Pretrained-7B-DPO]((https://huggingface.co/Aleph-Alpha/Llama-TFree-HAT-Pretrained-7B-DPO)) Compression | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) Compression | [Llama-3.1-Tulu-3.1-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3.1-8B) Compression |
|
| 149 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 150 |
+
| Knowledge | MMLU | `norm_log_acc` | 5 | 0.654 | **0.681** | 0.664 | **5.818** | 4.885 | 4.153 |
|
| 151 |
+
| Knowledge | Full Text MMLU | `norm_log_acc` | 5 | 0.658 | **0.680** | 0.677 | **5.849** | 5.075 | 4.408 |
|
| 152 |
+
| Knowledge | MMLU Pro | `norm_log_acc` | 5 | 0.376 | **0.402** | 0.322 | **5.135** | 4.077 | 4.077 |
|
| 153 |
+
| Knowledge | GPQA | `log_acc` | 0 | 0.299 | **0.306** | 0.271 | **5.260** | 3.771 | 3.408 |
|
| 154 |
+
| Knowledge | BBH | `norm_log_acc` | 3 | 0.490 | **0.522** | 0.494 | **5.332** | 4.374 | 3.668 |
|
| 155 |
+
| Knowledge | OpenBookQA | `norm_log_acc` | 10 | 0.418 | 0.526 | **0.528** | **7.101** | 6.973 | 4.041 |
|
| 156 |
+
| Knowledge | TruthfulQA | `norm_prob_mass` | 6 | **0.429** | 0.171 | 0.173 | **6.607** | 5.553 | 3.807 |
|
| 157 |
+
| Reasoning | ARC Easy | `norm_log_acc` | 25 | **0.892** | 0.875 | 0.873 | **7.018** | 6.396 | 4.497 |
|
| 158 |
+
| Reasoning | ARC Challenge | `norm_log_acc` | 25 | **0.655** | 0.638 | 0.650 | **6.860** | 6.218 | 4.522 |
|
| 159 |
+
| Reasoning | Winogrande | `norm_log_acc` | 5 | **0.713** | 0.657 | 0.683 | **6.856** | 6.517 | 4.116 |
|
| 160 |
+
| Reasoning | HellaSwag | `norm_log_acc` | 10 | 0.608 | 0.776 | **0.807** | **5.980** | 5.274 | 4.427 |
|
| 161 |
+
| German | MMMLU | `norm_log_acc` | 5 | **0.610** | 0.590 | 0.572 | **6.630** | 3.912 | 3.383 |
|
| 162 |
+
| German | [ARC Easy DE](https://huggingface.co/datasets/openGPT-X/arcx) | `norm_log_acc` | 25 | **0.823** | 0.729 | 0.751 | **7.872** | 4.910 | 3.607 |
|
| 163 |
+
| German | [ARC Challenge DE](https://huggingface.co/datasets/openGPT-X/arcx) | `norm_log_acc` | 25 | **0.599** | 0.503 | 0.525 | **7.798** | 4.862 | 3.610 |
|
| 164 |
+
| German | [Winogrande DE](https://huggingface.co/datasets/demelin/wino_x) | `norm_log_acc` | 5 | **0.799** | 0.729 | 0.711 | **7.225** | 5.310 | 3.391 |
|
| 165 |
+
| German | [HellaSwag DE](https://huggingface.co/datasets/openGPT-X/hellaswagx) | `norm_log_acc` | 10 | 0.535 | 0.626 | **0.657** | **6.971** | 4.137 | 3.603 |
|
| 166 |
+
| German | [TruthfulQA DE](https://huggingface.co/datasets/openGPT-X/truthfulqax) | `norm_prob_mass` | 6 | **0.420** | 0.168 | 0.171 | **7.394** | 4.581 | 3.276 |
|
| 167 |
+
| German | [GSM8K DE](https://huggingface.co/datasets/openGPT-X/gsm8kx) | `comp_acc` | 8 | 0.574 | 0.201 | **0.724** | **4.84** | 3.320 | 2.963 |
|
| 168 |
+
| German | WMT16 | `bleu` | 3 | 31.205 | 34.224 | 32.912 | **6.811** | 5.061 | 4.000 |
|
| 169 |
+
| German | WMT16 Instruct | `bleu` | 3 | 31.408 | **34.260** | 33.089 | **6.863** | 5.130 | 4.063 |
|
| 170 |
+
| Math | GSM8K | `comp_acc` | 8 | 0.711 | 0.757 | **0.870** | **4.324** | 3.794 | 3.356 |
|
| 171 |
+
| Long context | QuALITY | `log_acc` | 0 | 0.376 | 0.412 | **0.425** | **4.867** | 4.290 | 4.274 |
|
| 172 |
+
| Long context | ZeroSCROLLS MuSiQue | `F1` | 0 | 0.238 | 0.200 | 0.145 | **5.636** | 4.427 | 4.387 |
|
| 173 |
+
| Long context | ZeroSCROLLS Qasper | `F1` | 0 | 0.228 | **0.235** | 0.221 | **5.934** | 4.826 | 4.808 |
|
| 174 |
+
| Long context | ZeroSCROLLS QuALITY | `log_acc` | 0 | 0.667 | 0.810 | 0.714 | **4.565** | 4.230 | 4.215 |
|
| 175 |
+
| Long context | ZeroSCROLLS SpaceDigest | `ES` | 0 | 0.278 | **0.638** | 0.490 | **5.770** | 4.518 | 4.505 |
|
| 176 |
+
| Long context | ZeroSCROLLS SQuALITY | `rouge_gm` | 0 | 0.144 | **0.164** | 0.163 | **4.965** | 4.240 | 4.241 |
|
| 177 |
|
| 178 |
# Training Details
|
| 179 |
|
|
|
|
| 185 |
|
| 186 |
## Encoder module
|
| 187 |
|
| 188 |
+
| | **119M** |
|
| 189 |
| --- | --- |
|
| 190 |
| Number of layers | 6 |
|
| 191 |
| Number of attention heads | 8 |
|
|
|
|
| 195 |
| Cross-attention hidden size | 4096 |
|
| 196 |
| MLP expansion factor | 2.75 |
|
| 197 |
| MLP type | SwiGLU |
|
| 198 |
+
| Sequence length | 163840 |
|
| 199 |
| Position embeddings | RoPE with base 1e5 |
|
| 200 |
| Attention type | causal, local with window size 768 |
|
| 201 |
| QK-norm | disabled |
|
| 202 |
|
| 203 |
## Backbone module
|
| 204 |
|
| 205 |
+
| | **7B** |
|
| 206 |
| --- | --- |
|
| 207 |
| Number of layers | 32 |
|
| 208 |
| Number of attention heads | 32 |
|
|
|
|
| 211 |
| Hidden size | 4096 |
|
| 212 |
| MLP expansion factor | 3.5 |
|
| 213 |
| MLP type | SwiGLU |
|
| 214 |
+
| Sequence length | 20480 |
|
| 215 |
| Position embeddings | RoPE with base 5e5 |
|
| 216 |
| Attention type | causal |
|
| 217 |
| QK-norm | per head |
|
| 218 |
|
| 219 |
## Decoder module
|
| 220 |
|
| 221 |
+
| | **94M** |
|
| 222 |
| --- | --- |
|
| 223 |
| Number of layers | 4 |
|
| 224 |
| Number of attention heads | 8 |
|
|
|
|
| 228 |
| Cross-attention hidden size | 4096 |
|
| 229 |
| MLP expansion factor | 2.75 |
|
| 230 |
| MLP type | SwiGLU |
|
| 231 |
+
| Sequence length | 163840 |
|
| 232 |
| Position embeddings | RoPE with base 1e5 |
|
| 233 |
| Attention type | causal, local with window size 768 |
|
| 234 |
| QK-norm | disabled |
|
|
|
|
| 250 |
|
| 251 |
To improve the processing of code and math documents, we made additional adjustments to the Unicode splitter. First, we split instances of camel cases like FooBar into Foo and Bar. Second, we treated math symbols (again by Unicode standard) as separate words.
|
| 252 |
|
| 253 |
+
## Instruction Fine-tuning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
+
### Approach
|
| 256 |
|
| 257 |
+
We optimized `TFree-HAT-Pretrained-7B-Base` for instruction-following using a standard post-training pipeline. First, we applied supervised fine-tuning (SFT) to train the model on both single-turn and multi-turn (chat) instruction-following tasks. Next, we aligned our model for helpfulness and, in parts, safety using Direct Preference Optimization (DPO).
|
| 258 |
|
| 259 |
+
### Data
|
| 260 |
|
| 261 |
+
The data used for instruction fine-tuning is based on a mixture of user prompts and model competitions. The data mixture consists of roughly 2M samples from diverse datasets including but not limited to: specialized reasoning datasets covering mathematics, programming, and logical inference; human feedback focused on helpful and harmless responses; a small curated set for specific response patterns; safety and robustness subsets for appropriate boundaries; collaborative conversational data; multilingual conversation prompts; tabular data reasoning for structured information; and formal mathematics with advanced problems.
|
| 262 |
|
| 263 |
+
We synthesized responses to the prompts using Qwen 2.5-32B and Qwen 2.5-72B. Additionally, we improved German performance by translating English prompts using Mistral-Nemo-Instruct-2407, generating the corresponding answers using Mistral-Small-3.1-Instruct, and performing quality filtering using an LLM judge based on Llama-3.3-70B-Instruct. Lastly, we supplemented the synthetic data with proprietary human-generated SFT data as well as further data sources.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 264 |
|
| 265 |
+
For DPO training, we used a similar dataset of prompts and completions from diverse domains.
|
| 266 |
|
| 267 |
## Legal Compliance
|
| 268 |
|
|
|
|
| 272 |
|
| 273 |
## Compute & Training Efficiency
|
| 274 |
|
| 275 |
+
The following table shows the compute resources used in the training stages for the 7B models.
|
| 276 |
|
| 277 |
| **Model** | **Training phase** | **GPUs** | **Approximate average power consumption per GPU** | **Approximate GPU hours** |
|
| 278 |
| --- | --- | --- | --- | --- |
|
| 279 |
+
| 7B | Long context SFT | 128 x H100 | 160W | 1,500 |
|
| 280 |
+
| 7B | DPO | 128 x H100 | 160W | 1,300 |
|
|
|
|
| 281 |
|
| 282 |
## Environmental Impact
|
| 283 |
|
|
|
|
| 424 |
|
| 425 |
This list of risks, biases, and limitations may not be complete, as improving the understanding and behavior of language models is an ongoing research topic in the AI science community.
|
| 426 |
|
| 427 |
+
# Legal Acknowledgements
|
| 428 |
+
|
| 429 |
+
- **Built with Llama**: Built with Llama: Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. The applicable license agreement can be found under the following link: [Llama 3.1 Community License Agreement ](https://www.llama.com/llama3_1/license/)
|
| 430 |
+
|
| 431 |
+
- **Improved using Qwen**
|
| 432 |
+
|
| 433 |
|
| 434 |
\*Aleph Alpha Research refers to Aleph Alpha Research GmbH
|
| 435 |
|
| 436 |
+
[hat-paper]: https://arxiv.org/abs/2501.10322
|