File size: 6,985 Bytes
5941093
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36ef54d
5941093
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36ef54d
 
 
5941093
36ef54d
5941093
 
36ef54d
 
 
 
 
 
 
 
5941093
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: apache-2.0
datasets:
- TIGER-Lab/MMEB-train
language:
- en
base_model:
- llava-hf/llava-onevision-qwen2-7b-ov-hf
library_name: transformers
tags:
- Retrieval
- Multimodal
- Embedding
pipeline_tag: image-text-to-text
---

<div align="center">

<h1>UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning</h1>

<a href="https://scholar.google.com/citations?hl=zh-CN&user=9etrpbYAAAAJ">Tiancheng Gu*</a>,</span>
<a href="https://kaicheng-yang0828.github.io">Kaicheng Yang*</a>,</span>
<a href="https://kcz358.github.io/">Kaichen Zhang</a>,</span>
<a href="https://scholar.google.com/citations?hl=zh-CN&user=1ckaPgwAAAAJ">Xiang An</a>,</span>
Ziyong Feng,</span> \
<a href="https://scholar.google.com/citations?hl=en&user=LatWlFAAAAAJ">Yueyi Zhang</a>,</span>
<a href="https://weidong-tom-cai.github.io">Weidong Cai</a>,</span>
<a href="https://jiankangdeng.github.io">Jiankang Deng</a>,</span>
<a href="https://lidongbing.github.io">Lidong Bing</a></span>

[![Project Website](https://img.shields.io/badge/🏑-Project%20Website-deepgray)](https://garygutc.github.io/UniME-v2/)
[![Paper](https://img.shields.io/badge/πŸ“„-Paper-b31b1b.svg)]()
[![GitHub](https://img.shields.io/badge/⭐-GitHub-black?logo=github)](https://github.com/GaryGuTC/UniME-v2)
</div>

## πŸ’‘ Highlights
- We introduce an MLLM-as-a-Judge pipeline for hard negative mining that uses the advanced understanding capabilities of MLLM to assess the semantic alignment of each query-candidate pair within a globally retrieved potential hard negative set.

<div align="center">
  <img src="Figures/method1.jpg" width="95%">
</div>

- We present UniME-V2, a novel universal multimodal embedding model trained with an MLLM judgment based distribution alignment framework. By leveraging semantic matching scores as soft labels, the model effectively captures semantic differences between candidates, significantly enhancing its discriminative capability. Meanwhile, we propose UniME-V2-Reranker, a reranking model trained on high-quality, diverse hard negatives through a joint pairwise and listwise optimization approach.

<div align="center">
  <img src="Figures/method2.jpg" width="60%">
</div>

## πŸ› οΈ Implementation

## πŸš€ Quick Start
```bash
git clone https://github.com/deepglint/UniME-v2.git
cd UniME-v2
```

```bash
conda create -n uniMEv2 python=3.10 -y
conda activate uniMEv2
pip install -r requirements.txt

# Optional: Install Flash Attention for acceleration
# wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
```

### πŸ” Embedding model & Rerank model
```python
import torch
from torch.nn import functional as F
from utils.utils import init_model_and_processor, prepare_stage_data, parse_answer_index

device="cuda"
embedding=False # adjust embedding model or rerank model
if embedding:
    model_name="models/UniME-V2_qwen2VL_2B"
    # model_name="models/UniME-V2_qwen2VL_7B"
    # model_name="models/UniME-V2_LLaVA_onevision_8B"
    text = "A man is crossing the street with a red car parked nearby."
    image_path = "Figures/demo.png"
else:
    model_name="models/UniME-v2-rerank_qwen25VL_7B"
    text = ["A man is crossing the street with a red car parked nearby.",  #! Target text
            "A woman is walking her dog with a blue bicycle leaning nearby.",
            "A child is riding a scooter past a green truck stopped nearby.",
            "A couple is waiting for the bus beside a yellow taxi parked nearby.",
            "A jogger is running along the path with a black motorcycle parked nearby."]
    image_path = "Figures/demo.png"

model, processor = init_model_and_processor(model_name, device, embedding=embedding)

if embedding:
    inputs_image, inputs_txt = prepare_stage_data(model_name, processor, text, image_path, embedding=embedding)
    inputs_image = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_image.items()}
    inputs_txt = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_txt.items()}
    with torch.no_grad():
        emb_text = model(**inputs_txt, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
        emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
        emb_text = F.normalize(emb_text, dim=-1)
        emb_image = F.normalize(emb_image, dim=-1)
        Score = emb_image @ emb_text.T
        print("Score: ", Score.item()) # qwen2VL 2B : Score: 0.62109375
else:
    inputs = prepare_stage_data(model_name, processor, text, image_path, embedding=embedding)
    inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=128, output_scores=True, return_dict_in_generate=True, do_sample=False).sequences
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs['input_ids'], generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print("Rerank Answer: ", parse_answer_index(output_text[0])) # qwen25VL 7B: Rerank Answer: 0
```

## πŸ“Š Results

### 🌈 Diversity Retrieval
<div align="center">
  <img src="Figures/UniME_v2_diversity_retrieval.png" width="90%">
</div>


### πŸ† MMEB
<div align="center">
  <img src="Figures/UniME_v2_MMEB.png" width="90%">
</div>

## πŸ’¬ Support
| Team Member | Email |
|-------------|-------|
| **Tiancheng Gu** | [![Email](https://img.shields.io/badge/πŸ“§[email protected]?logo=gmail)](mailto:[email protected]) | 
| **Kaicheng Yang** | [![Email](https://img.shields.io/badge/πŸ“§[email protected]?logo=gmail)](mailto:[email protected]) |


## πŸ–ŠοΈ Citation
If you find this repository useful, please use the following BibTeX entry for citation.
```latex
@misc{gu2025unimev2mllmasajudgeuniversalmultimodal,
      title={UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning}, 
      author={Tiancheng Gu and Kaicheng Yang and Kaichen Zhang and Xiang An and Ziyong Feng and Yueyi Zhang and Weidong Cai and Jiankang Deng and Lidong Bing},
      year={2025},
      eprint={2510.13515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13515}, 
}

@inproceedings{unime,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs},
      author={Gu, Tiancheng and Yang, Kaicheng and Feng, Ziyong and Wang, Xingjun and Zhang, Yanzhao and Long, Dingkun and Chen, Yingda and Cai, Weidong and Deng, Jiankang},
      booktitle={ACM MM},
      year={2025}
}

```

<div align="center">
⭐ Don't forget to star this repository if you find it helpful!

</div>