UMCU
/

CardioBERTa.nl_clinical

Model card Files Files and versions

UMCU commited on Mar 14

Commit

b68c505

·

verified ·

1 Parent(s): f13ce2a

Update README.md

Files changed (1) hide show

README.md +52 -3

README.md CHANGED Viewed

@@ -1,3 +1,52 @@
----
-license: gpl-3.0
----

+---
+license: gpl-3.0
+language:
+- nl
+base_model:
+- CLTL/MedRoBERTa.nl
+tags:
+- medical
+- healthcare
+metrics:
+- perplexity
+library_name: transformers
+---
+Continued, off-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl) using about 50GB of open Dutch and translated
+English corpora, followed by on-premise pre-training on 5GB of Electronic Health records mixed with 2GB of the public set.
+# Data statistics
+Sources:
+* Dutch: medical guidelines (FMS, NHG)
+* Dutch: [NtvG](https://www.ntvg.nl/) papers
+* English: Pubmed abstracts
+* English: PMC abstracts translated using DeepL
+* English: Apollo guidelines, papers and books
+* English: Meditron guidelines
+* English: MIMIC3
+* English: MIMIC CXR
+* English: MIMIC4
+All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
+* Number of tokens: 15B
+* Number of documents: 27M
+# Training
+* Effective batch size: 5120
+* Learning rate: 2e-4
+* Weight decay: 1e-3
+* Learning schedule: linear, with 5_000 warmup steps
+* Num epochs: ~3
+Train perplexity: 2.5
+Validation perplexity: 3.4
+# Acknowledgement
+This work was done together with the Amsterdam UMC, in the context of the [DataTools4Heart](https://www.datatools4heart.eu/) project.
+We were happy to be able to use the [Google TPU research cloud](https://sites.research.google/trc/about/) for training the model.