Update README.md
Browse files
README.md
CHANGED
|
@@ -77,7 +77,10 @@ The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total)
|
|
| 77 |
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
|
| 78 |
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
|
| 79 |
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_qa.py \
|
| 83 |
--model_name_or_path bert-large-uncased-whole-word-masking \
|
|
|
|
| 77 |
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
|
| 78 |
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
|
| 79 |
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
|
| 80 |
+
|
| 81 |
+
### Fine-tuning
|
| 82 |
+
|
| 83 |
+
After pre-training, this model was fine-tuned on the SQuAD dataset with one of our fine-tuning scripts. In order to reproduce the training, you may use the following command:
|
| 84 |
```
|
| 85 |
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_qa.py \
|
| 86 |
--model_name_or_path bert-large-uncased-whole-word-masking \
|