SPPO
					Collection
				
Self-Play Preference Optimization
					β’ 
				10 items
				β’ 
				Updated
					
				β’
					
					13
Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
This model was developed using Self-Play Preference Optimization at iteration 3, based on the google/gemma-2-9b-it architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.
Terms of Use: Terms
| Model | LC. Win Rate | Win Rate | Avg. Length | 
|---|---|---|---|
| Gemma-2-9B-SPPO Iter1 | 48.70 | 40.76 | 1669 | 
| Gemma-2-9B-SPPO Iter2 | 50.93 | 44.64 | 1759 | 
| Gemma-2-9B-SPPO Iter3 | 53.27 | 47.74 | 1803 | 
The following hyperparameters were used during training:
@misc{wu2024self,
      title={Self-Play Preference Optimization for Language Model Alignment}, 
      author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
      year={2024},
      eprint={2405.00675},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}