zephyr-7b-dpo-full-prometheus-high-curriculum

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.4919
Rewards/chosen: -1.0893
Rewards/rejected: -1.9611
Rewards/accuracies: 0.7457
Rewards/margins: 0.8717
Logps/rejected: -444.3828
Logps/chosen: -368.8934
Logits/rejected: 2.6386
Logits/chosen: 2.0277

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 55
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6482	0.1143	50	0.6407	-0.1062	-0.2248	0.6853	0.1186	-270.7546	-270.5768	-2.5123	-2.5571
0.5804	0.2286	100	0.5786	-0.5277	-0.8872	0.6724	0.3595	-336.9930	-312.7311	-2.4375	-2.4881
0.5484	0.3429	150	0.5353	-0.9089	-1.5948	0.7241	0.6859	-407.7577	-350.8555	0.5418	0.1584
0.5259	0.4571	200	0.5121	-1.0561	-1.8604	0.7371	0.8043	-434.3204	-365.5715	1.3888	0.9506
0.5137	0.5714	250	0.5031	-0.8771	-1.6989	0.7155	0.8218	-418.1702	-347.6707	1.5929	1.0452
0.5034	0.6857	300	0.4974	-1.1806	-1.9828	0.7241	0.8022	-446.5578	-378.0252	2.7550	2.1828
0.498	0.8	350	0.4931	-1.1358	-1.9863	0.7414	0.8505	-446.9080	-373.5433	2.6879	2.0881
0.4898	0.9143	400	0.4919	-1.0893	-1.9611	0.7457	0.8717	-444.3828	-368.8934	2.6386	2.0277