# OPA-DPO LoRA for LLaVA-v1.5-7B

## Introduction

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Due to the implicit KL-divergence constraint, off-policy data cannot be effectively learned.

We propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Compared with DPO without OPA operations, OPA-DPO significantly enhances performance. It achieves SOTA performance with only 4.8k training data, while most DPO-based algorithms require over 10k data.

## Usage

Please refer to our [Github Repository](https://github.com/zhyang2226/OPA-DPO) for more details. If you wish to use our model outside of our code, make sure to update the `base_model_name_or_path` in the `adapter_config.json` file to `liuhaotian/llava-v1.5-7b`.

Please note that the LoRA modules are also added on top of the vision tower. **Ensure that the vision tower is loaded before loading the LoRA module.**

## Acknowledgements

We would like to express our gratitude for the code snippets provided in [LLaVA](https://github.com/haotian-liu/LLaVA), [LLaVA-RLHF](https://github.com/llava-rlhf/LLaVA-RLHF), [FastChat](https://github.com/lm-sys/FastChat) and [TRL](https://github.com/huggingface/trl), and datasets provided in [RLAIF-V](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). These resources have significantly contributed to the development of our project.