Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,52 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: gpl-3.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: gpl-3.0
|
| 3 |
+
language:
|
| 4 |
+
- nl
|
| 5 |
+
base_model:
|
| 6 |
+
- CLTL/MedRoBERTa.nl
|
| 7 |
+
tags:
|
| 8 |
+
- medical
|
| 9 |
+
- healthcare
|
| 10 |
+
metrics:
|
| 11 |
+
- perplexity
|
| 12 |
+
library_name: transformers
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
Continued, off-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl) using about 50GB of open Dutch and translated
|
| 16 |
+
English corpora, followed by on-premise pre-training on 5GB of Electronic Health records mixed with 2GB of the public set.
|
| 17 |
+
|
| 18 |
+
# Data statistics
|
| 19 |
+
|
| 20 |
+
Sources:
|
| 21 |
+
* Dutch: medical guidelines (FMS, NHG)
|
| 22 |
+
* Dutch: [NtvG](https://www.ntvg.nl/) papers
|
| 23 |
+
* English: Pubmed abstracts
|
| 24 |
+
* English: PMC abstracts translated using DeepL
|
| 25 |
+
* English: Apollo guidelines, papers and books
|
| 26 |
+
* English: Meditron guidelines
|
| 27 |
+
* English: MIMIC3
|
| 28 |
+
* English: MIMIC CXR
|
| 29 |
+
* English: MIMIC4
|
| 30 |
+
|
| 31 |
+
All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
|
| 32 |
+
|
| 33 |
+
* Number of tokens: 15B
|
| 34 |
+
* Number of documents: 27M
|
| 35 |
+
|
| 36 |
+
# Training
|
| 37 |
+
|
| 38 |
+
* Effective batch size: 5120
|
| 39 |
+
* Learning rate: 2e-4
|
| 40 |
+
* Weight decay: 1e-3
|
| 41 |
+
* Learning schedule: linear, with 5_000 warmup steps
|
| 42 |
+
* Num epochs: ~3
|
| 43 |
+
|
| 44 |
+
Train perplexity: 2.5
|
| 45 |
+
Validation perplexity: 3.4
|
| 46 |
+
|
| 47 |
+
# Acknowledgement
|
| 48 |
+
|
| 49 |
+
This work was done together with the Amsterdam UMC, in the context of the [DataTools4Heart](https://www.datatools4heart.eu/) project.
|
| 50 |
+
|
| 51 |
+
We were happy to be able to use the [Google TPU research cloud](https://sites.research.google/trc/about/) for training the model.
|
| 52 |
+
|