mmBERT: a modern multilingual encoder - a jhu-clsp Collection

jhu-clsp 's Collections

mmBERT: a modern multilingual encoder

Encoders vs Decoders: the Ettin Suite

rank1

mmBERT: a modern multilingual encoder

updated Sep 9

mmBERT is trained on 3T tokens from over 1800 languages, showing SoTA scores on benchmarks and exceptional low-resource performance

jhu-clsp/mmBERT-base

Fill-Mask • Updated 24 days ago • 35.4k • • 154
jhu-clsp/mmBERT-small

Fill-Mask • Updated 13 days ago • 6.57k • • 52

Note Intermediate checkpoints for continued pre-training (MosiacML Composer format)
jhu-clsp/mmBERT-checkpoints

Updated Sep 9 • 2

Note Pre-training Data
jhu-clsp/mmBERT-pretrain-p1-fineweb2-langs

Updated 17 days ago • 1.26k • 4
jhu-clsp/mmBERT-pretrain-p2-fineweb2-remaining

Updated 17 days ago • 2.91k
jhu-clsp/mmBERT-pretrain-p3-others

Updated 17 days ago • 26.5k
jhu-clsp/mmBERT-midtraining-data

Updated 17 days ago • 29.2k • 1
jhu-clsp/mmBERT-decay-data

Updated 17 days ago • 9.26k • 2

Note Randomized data for training (not recommended unless you are using the same data mix)
orionweller/mmBERT-data-decay-all

Updated Sep 9 • 408
orionweller/mmBERT-data-decay-cont

Updated Sep 9 • 264
orionweller/mmBERT-data-decay-eng

Updated Sep 9 • 513
orionweller/mmBERT-data-midtraining

Updated Sep 9 • 322
orionweller/mmBERT-pretraining-data-chunk0

Preview • Updated Sep 9 • 5.54k
orionweller/mmBERT-pretraining-data-chunk1

Preview • Updated Sep 9 • 2.7k
orionweller/mmBERT-pretraining-data-chunk2

Preview • Updated Sep 9 • 443
orionweller/mmBERT-pretraining-data-chunk3

Preview • Updated Sep 9 • 1.79k