UMCU commited on
Commit
b68c505
·
verified ·
1 Parent(s): f13ce2a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -3
README.md CHANGED
@@ -1,3 +1,52 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ language:
4
+ - nl
5
+ base_model:
6
+ - CLTL/MedRoBERTa.nl
7
+ tags:
8
+ - medical
9
+ - healthcare
10
+ metrics:
11
+ - perplexity
12
+ library_name: transformers
13
+ ---
14
+
15
+ Continued, off-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl) using about 50GB of open Dutch and translated
16
+ English corpora, followed by on-premise pre-training on 5GB of Electronic Health records mixed with 2GB of the public set.
17
+
18
+ # Data statistics
19
+
20
+ Sources:
21
+ * Dutch: medical guidelines (FMS, NHG)
22
+ * Dutch: [NtvG](https://www.ntvg.nl/) papers
23
+ * English: Pubmed abstracts
24
+ * English: PMC abstracts translated using DeepL
25
+ * English: Apollo guidelines, papers and books
26
+ * English: Meditron guidelines
27
+ * English: MIMIC3
28
+ * English: MIMIC CXR
29
+ * English: MIMIC4
30
+
31
+ All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
32
+
33
+ * Number of tokens: 15B
34
+ * Number of documents: 27M
35
+
36
+ # Training
37
+
38
+ * Effective batch size: 5120
39
+ * Learning rate: 2e-4
40
+ * Weight decay: 1e-3
41
+ * Learning schedule: linear, with 5_000 warmup steps
42
+ * Num epochs: ~3
43
+
44
+ Train perplexity: 2.5
45
+ Validation perplexity: 3.4
46
+
47
+ # Acknowledgement
48
+
49
+ This work was done together with the Amsterdam UMC, in the context of the [DataTools4Heart](https://www.datatools4heart.eu/) project.
50
+
51
+ We were happy to be able to use the [Google TPU research cloud](https://sites.research.google/trc/about/) for training the model.
52
+