mmBERT: a modern multilingual encoder - a jhu-clsp Collection

jhu-clsp 's Collections

mmBERT: a modern multilingual encoder

Encoders vs Decoders: the Ettin Suite

rank1

mmBERT: a modern multilingual encoder

updated Sep 9

mmBERT is trained on 3T tokens from over 1800 languages, showing SoTA scores on benchmarks and exceptional low-resource performance

jhu-clsp/mmBERT-base

Fill-Mask • Updated 20 days ago • 31.8k • • 154
jhu-clsp/mmBERT-small

Fill-Mask • Updated 10 days ago • 7.31k • • 51

Note Intermediate checkpoints for continued pre-training (MosiacML Composer format)
jhu-clsp/mmBERT-checkpoints

Updated Sep 9 • 2

Note Pre-training Data
jhu-clsp/mmBERT-pretrain-p1-fineweb2-langs

Updated 14 days ago • 1.32k • 4
jhu-clsp/mmBERT-pretrain-p2-fineweb2-remaining

Updated 14 days ago • 3.22k
jhu-clsp/mmBERT-pretrain-p3-others

Updated 14 days ago • 20k
jhu-clsp/mmBERT-midtraining-data

Updated 14 days ago • 23.7k • 1
jhu-clsp/mmBERT-decay-data

Updated 14 days ago • 6.1k • 2

Note Randomized data for training (not recommended unless you are using the same data mix)
orionweller/mmBERT-data-decay-all

Updated Sep 9 • 341
orionweller/mmBERT-data-decay-cont

Updated Sep 9 • 211
orionweller/mmBERT-data-decay-eng

Updated Sep 9 • 345
orionweller/mmBERT-data-midtraining

Updated Sep 9 • 387
orionweller/mmBERT-pretraining-data-chunk0

Preview • Updated Sep 9 • 3.95k
orionweller/mmBERT-pretraining-data-chunk1

Preview • Updated Sep 9 • 1.71k
orionweller/mmBERT-pretraining-data-chunk2

Preview • Updated Sep 9 • 475
orionweller/mmBERT-pretraining-data-chunk3

Preview • Updated Sep 9 • 1.24k