Update README.md
Browse files
README.md
CHANGED
|
@@ -40,18 +40,18 @@ Zephyr is a series of language models that are trained to act as helpful assista
|
|
| 40 |
|
| 41 |
## Performance
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
| Model | Size |
|
| 46 |
|-------------|-----|----|---------------|--------------|
|
| 47 |
| StableLM-Tuned-α | 7B| dSFT |2.75| -|
|
| 48 |
| MPT-Chat | 7B |dSFT |5.42| -|
|
| 49 |
| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
|
| 50 |
| Mistral-Instructv0.1 | 7B| - | 6.84 |-|
|
| 51 |
| Zephyr-7b-α |7B| dDPO| 6.88| -|
|
| 52 |
-
| **Zephyr-7b-β** |7B|
|
| 53 |
| Falcon-Instruct | 40B |dSFT |5.17 |45.71|
|
| 54 |
-
| Guanaco 65B | SFT |6.41| 71.80|
|
| 55 |
| Llama2-Chat | 70B |RLHF |6.86| 92.66|
|
| 56 |
| Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
|
| 57 |
| WizardLM v1.0 | 70B |dSFT |7.71 |-|
|
|
@@ -60,6 +60,13 @@ Zephyr is a series of language models that are trained to act as helpful assista
|
|
| 60 |
| Claude 2 | - |RLHF |8.06| 91.36|
|
| 61 |
| GPT-4 | -| RLHF |8.99| 95.28|
|
| 62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
## Intended uses & limitations
|
| 64 |
|
| 65 |
The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
|
|
@@ -108,9 +115,8 @@ It is also unknown what the size and composition of the corpus was used to train
|
|
| 108 |
|
| 109 |
## Training and evaluation data
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
It achieves the following results on the evaluation set:
|
| 114 |
- Loss: 0.7496
|
| 115 |
- Rewards/chosen: -4.5221
|
| 116 |
- Rewards/rejected: -8.3184
|
|
@@ -140,6 +146,9 @@ The following hyperparameters were used during training:
|
|
| 140 |
|
| 141 |
### Training results
|
| 142 |
|
|
|
|
|
|
|
|
|
|
| 143 |
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
| 144 |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
| 145 |
| 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
|
|
|
|
| 40 |
|
| 41 |
## Performance
|
| 42 |
|
| 43 |
+
At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
|
| 44 |
|
| 45 |
+
| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
|
| 46 |
|-------------|-----|----|---------------|--------------|
|
| 47 |
| StableLM-Tuned-α | 7B| dSFT |2.75| -|
|
| 48 |
| MPT-Chat | 7B |dSFT |5.42| -|
|
| 49 |
| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
|
| 50 |
| Mistral-Instructv0.1 | 7B| - | 6.84 |-|
|
| 51 |
| Zephyr-7b-α |7B| dDPO| 6.88| -|
|
| 52 |
+
| **Zephyr-7b-β** 🪁 | **7B** | **dDPO** | **7.34** | **90.60** |
|
| 53 |
| Falcon-Instruct | 40B |dSFT |5.17 |45.71|
|
| 54 |
+
| Guanaco | 65B | SFT |6.41| 71.80|
|
| 55 |
| Llama2-Chat | 70B |RLHF |6.86| 92.66|
|
| 56 |
| Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
|
| 57 |
| WizardLM v1.0 | 70B |dSFT |7.71 |-|
|
|
|
|
| 60 |
| Claude 2 | - |RLHF |8.06| 91.36|
|
| 61 |
| GPT-4 | -| RLHF |8.99| 95.28|
|
| 62 |
|
| 63 |
+
In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B:
|
| 64 |
+
|
| 65 |
+

|
| 66 |
+
|
| 67 |
+
However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap.
|
| 68 |
+
|
| 69 |
+
|
| 70 |
## Intended uses & limitations
|
| 71 |
|
| 72 |
The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
|
|
|
|
| 115 |
|
| 116 |
## Training and evaluation data
|
| 117 |
|
| 118 |
+
During DPO training, this model achieves the following results on the evaluation set:
|
| 119 |
|
|
|
|
| 120 |
- Loss: 0.7496
|
| 121 |
- Rewards/chosen: -4.5221
|
| 122 |
- Rewards/rejected: -8.3184
|
|
|
|
| 146 |
|
| 147 |
### Training results
|
| 148 |
|
| 149 |
+
The table below shows the full set of DPO training metrics:
|
| 150 |
+
|
| 151 |
+
|
| 152 |
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
| 153 |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
| 154 |
| 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
|