update readme
Browse files
README.md
CHANGED
|
@@ -14,13 +14,6 @@ Given the current market price of H100 GPU hours, training the model only costs
|
|
| 14 |
To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
|
| 15 |
Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
|
| 16 |
|
| 17 |
-
<figure>
|
| 18 |
-
<center>
|
| 19 |
-
<img src="images/jetmoe_architecture.png" width="40%">
|
| 20 |
-
<figcaption>JetMoE Architecture</figcaption>
|
| 21 |
-
</center>
|
| 22 |
-
</figure>
|
| 23 |
-
|
| 24 |
## Evaluation Results
|
| 25 |
|Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
|
| 26 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -57,6 +50,13 @@ Each MoA and MoE layer has 8 expert, and 2 experts are activated for each input
|
|
| 57 |
It has 8 billion parameters in total and 2.2B active parameters.
|
| 58 |
JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
**Input** Models input text only.
|
| 61 |
|
| 62 |
**Output** Models generate text only.
|
|
|
|
| 14 |
To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
|
| 15 |
Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Evaluation Results
|
| 18 |
|Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
|
| 19 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| 50 |
It has 8 billion parameters in total and 2.2B active parameters.
|
| 51 |
JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
|
| 52 |
|
| 53 |
+
<figure>
|
| 54 |
+
<center>
|
| 55 |
+
<img src="images/jetmoe_architecture.png" width="40%">
|
| 56 |
+
<figcaption>JetMoE Architecture</figcaption>
|
| 57 |
+
</center>
|
| 58 |
+
</figure>
|
| 59 |
+
|
| 60 |
**Input** Models input text only.
|
| 61 |
|
| 62 |
**Output** Models generate text only.
|