Spaces:

yhavinga
/

pre-training-dutch-t5-models

Running

App Files Files Community

yhavinga commited on Mar 11, 2023

Commit

6b8d575

1 Parent(s): 840c6e7

Small text updates.

Browse files

Files changed (2) hide show

README.md +1 -1
app.py +7 -5

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Pre-training Dutch T5 Models, evaluation and model lists
 emoji: 🚀
 colorFrom: blue
 colorTo: pink

 ---
+title: Pre-training Dutch T5 and UL2 Models, evaluation and model lists
 emoji: 🚀
 colorFrom: blue
 colorTo: pink

app.py CHANGED Viewed

@@ -320,10 +320,11 @@ mT5 green and the other models black.
 * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
   `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
   `mT5` counterparts of the comparable size.
-* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed hyperparameters across all models.
   Since the `UL2` models are better across the board, I've disabled this model on the hub.
 * The `long-t5` models show bad performance on both tasks.
-  I cannot explain this the translation task. With a sequence length of 128 input and output
   tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
   I've retried the fine-tuning of these models with
   `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
@@ -388,10 +389,11 @@ mT5 green and the other models black.
         """## Miscellaneous remarks
 * Use loss regularization when training with `bfloat16` for better results (more info below).
-* Be cautious of the dropout rate in the config.json file and consider training without it.
   Check in a model's `config.json` what the dropout rate has been set to. Unless you
   intend to run many epochs on the same data, its worth to try a training run without dropout.
-  If you want to compare losses, be sure to set the dropout rate equal.
   The smaller models can probably always be trained without.
 * Training with more layers is much slower than you'd expect from the increased model size.
   It is also more difficult to get batch size and learning rate right. Below is a section
@@ -628,7 +630,7 @@ I am grateful to the [https://huggingface.co/Finnish-NLP](Finnish-NLP) authors f
 definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
 Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
-Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
 Some of the sentences were reworded by ChatGPT.
 """
     )

 * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
   `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
   `mT5` counterparts of the comparable size.
+* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the hyperparameters that were fixed to the same value for the
+  evaluation of every model.
   Since the `UL2` models are better across the board, I've disabled this model on the hub.
 * The `long-t5` models show bad performance on both tasks.
+  I cannot explain this, especially for the translation task. With a sequence length of 128 input and output
   tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
   I've retried the fine-tuning of these models with
   `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
         """## Miscellaneous remarks
 * Use loss regularization when training with `bfloat16` for better results (more info below).
+* Be cautious of the dropout rate in the config.json file, as besides learning rate it is probably the most important
+  hyperparameter.
+  If you are evaluating different pre-trained models, be sure to fine-tune with dropout set equal.
   Check in a model's `config.json` what the dropout rate has been set to. Unless you
   intend to run many epochs on the same data, its worth to try a training run without dropout.
   The smaller models can probably always be trained without.
 * Training with more layers is much slower than you'd expect from the increased model size.
   It is also more difficult to get batch size and learning rate right. Below is a section
 definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
 Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/).
 Some of the sentences were reworded by ChatGPT.
 """
     )