Small text updates.
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: Pre-training Dutch T5 Models, evaluation and model lists
|
| 3 |
emoji: π
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: pink
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Pre-training Dutch T5 and UL2 Models, evaluation and model lists
|
| 3 |
emoji: π
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: pink
|
app.py
CHANGED
|
@@ -320,10 +320,11 @@ mT5 green and the other models black.
|
|
| 320 |
* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
|
| 321 |
`UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
|
| 322 |
`mT5` counterparts of the comparable size.
|
| 323 |
-
* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed
|
|
|
|
| 324 |
Since the `UL2` models are better across the board, I've disabled this model on the hub.
|
| 325 |
* The `long-t5` models show bad performance on both tasks.
|
| 326 |
-
I cannot explain this the translation task. With a sequence length of 128 input and output
|
| 327 |
tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
|
| 328 |
I've retried the fine-tuning of these models with
|
| 329 |
`float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
|
|
@@ -388,10 +389,11 @@ mT5 green and the other models black.
|
|
| 388 |
"""## Miscellaneous remarks
|
| 389 |
|
| 390 |
* Use loss regularization when training with `bfloat16` for better results (more info below).
|
| 391 |
-
* Be cautious of the dropout rate in the config.json file
|
|
|
|
|
|
|
| 392 |
Check in a model's `config.json` what the dropout rate has been set to. Unless you
|
| 393 |
intend to run many epochs on the same data, its worth to try a training run without dropout.
|
| 394 |
-
If you want to compare losses, be sure to set the dropout rate equal.
|
| 395 |
The smaller models can probably always be trained without.
|
| 396 |
* Training with more layers is much slower than you'd expect from the increased model size.
|
| 397 |
It is also more difficult to get batch size and learning rate right. Below is a section
|
|
@@ -628,7 +630,7 @@ I am grateful to the [https://huggingface.co/Finnish-NLP](Finnish-NLP) authors f
|
|
| 628 |
definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
|
| 629 |
Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
|
| 630 |
|
| 631 |
-
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|
| 632 |
Some of the sentences were reworded by ChatGPT.
|
| 633 |
"""
|
| 634 |
)
|
|
|
|
| 320 |
* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
|
| 321 |
`UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
|
| 322 |
`mT5` counterparts of the comparable size.
|
| 323 |
+
* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the hyperparameters that were fixed to the same value for the
|
| 324 |
+
evaluation of every model.
|
| 325 |
Since the `UL2` models are better across the board, I've disabled this model on the hub.
|
| 326 |
* The `long-t5` models show bad performance on both tasks.
|
| 327 |
+
I cannot explain this, especially for the translation task. With a sequence length of 128 input and output
|
| 328 |
tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
|
| 329 |
I've retried the fine-tuning of these models with
|
| 330 |
`float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
|
|
|
|
| 389 |
"""## Miscellaneous remarks
|
| 390 |
|
| 391 |
* Use loss regularization when training with `bfloat16` for better results (more info below).
|
| 392 |
+
* Be cautious of the dropout rate in the config.json file, as besides learning rate it is probably the most important
|
| 393 |
+
hyperparameter.
|
| 394 |
+
If you are evaluating different pre-trained models, be sure to fine-tune with dropout set equal.
|
| 395 |
Check in a model's `config.json` what the dropout rate has been set to. Unless you
|
| 396 |
intend to run many epochs on the same data, its worth to try a training run without dropout.
|
|
|
|
| 397 |
The smaller models can probably always be trained without.
|
| 398 |
* Training with more layers is much slower than you'd expect from the increased model size.
|
| 399 |
It is also more difficult to get batch size and learning rate right. Below is a section
|
|
|
|
| 630 |
definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
|
| 631 |
Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
|
| 632 |
|
| 633 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/).
|
| 634 |
Some of the sentences were reworded by ChatGPT.
|
| 635 |
"""
|
| 636 |
)
|