We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process.
- ViVerBench: a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning.
- OmniVerifier-7B: Trained on large-scale visual verification data, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3).
- OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization.
OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
Quick Start: Generated Image Verification
Use the following code to test OmniVerifier-7B on a generated image:
Please modify image_path and prompt to your own settings.
The model will output both an answer and an explanation, indicating whether the image is strictly aligned with the given prompt.
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"comin/OmniVerifier-7B", torch_dtype=torch.bfloat16, device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("comin/OmniVerifier-7B")
image_path = '' # please replace it with your own image path
prompt = '' # please replace it with the prompt you use to generate the image
question = f"""This image was generated from the prompt: {prompt}.
Please carefully analyze the image and determine whether all the objects, attributes, and spatial relationships mentioned in the prompt are correctly represented in the image.
If the image accurately reflects the prompt, please answer 'true'; otherwise, answer 'false'.
Respond strictly in the following JSON format: """ + """
{
"answer": true/false,
"explanation": "If the answer is false, briefly summarize the main error.",
}
"""
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
},
{"type": "text", "text": question},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
@article{zhang2025generative,
author = {Zhang, Xinchen and Zhang, Xiaoying and Wu, Youbin and Cao, Yanbin and Zhang, Renrui and Chu, Ruihang and Yang, Ling and Yang, Yujiu},
title = {Generative Universal Verifier as Multimodal Meta-Reasoner},
journal = {arXiv preprint arXiv:2510.13804},
year = {2025}
}
- Downloads last month
- 80