Dataset Link Fix
Browse filesAdd the (apparently) correct links to P2/P3 datasets in the readme
README.md
CHANGED
|
@@ -90,8 +90,8 @@ mmBERT training data is publicly available across different phases:
|
|
| 90 |
| Phase | Dataset | Tokens | Description |
|
| 91 |
|:------|:--------|:-------|:------------|
|
| 92 |
| Pre-training P1 | [mmbert-pretrain-p1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T | 60 languages, foundational training |
|
| 93 |
-
| Pre-training P2 | [mmbert-pretrain-p2](https://huggingface.co/datasets/jhu-clsp/
|
| 94 |
-
| Pre-training P3 | [mmbert-pretrain-p3](https://huggingface.co/datasets/jhu-clsp/
|
| 95 |
| Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K |
|
| 96 |
| Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality |
|
| 97 |
|
|
|
|
| 90 |
| Phase | Dataset | Tokens | Description |
|
| 91 |
|:------|:--------|:-------|:------------|
|
| 92 |
| Pre-training P1 | [mmbert-pretrain-p1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T | 60 languages, foundational training |
|
| 93 |
+
| Pre-training P2 | [mmbert-pretrain-p2](https://huggingface.co/datasets/jhu-clsp/mmBERT-pretrain-p2-fineweb2-remaining) | - | Extension data for pre-training phase |
|
| 94 |
+
| Pre-training P3 | [mmbert-pretrain-p3](https://huggingface.co/datasets/jhu-clsp/mmBERT-pretrain-p3-others) | - | Final pre-training data |
|
| 95 |
| Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K |
|
| 96 |
| Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality |
|
| 97 |
|