MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Model Details

We introduce CORAL, a multi-modal embedding model built upon Qwen2.5-3B-Instruct. CORAL enables interleaved multi-condition semantic retrieval queries, and was trained using MERIT, a novel dataset proposed in our paper, MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query.

CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample.

CORAL Overview

Overview for CORAL

Example

Example Query and Ground Truth

Usage

Transformers

We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info


## Initialize Model and Processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Bia/CORAL", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Bia/CORAL")

## Prepare Inputs
query = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
            {
                "type": "image",
                "image": "CORAL/images/product_1.png",
            },
            {"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
            {
                "type": "image",
                "image": "CORAL/images/product_2.png",
            },
            {"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
            ],
    }
]

candidate = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Represent the given product: "},
            {
                "type": "image",
                "image": "CORAL/images/product_3.png",
            },
            {"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
        ],
    }
]

query_text = processor.apply_chat_template(
    query, tokenize=False, add_generation_prompt=True
)

candidate_text = processor.apply_chat_template(
    candidate, tokenize=False, add_generation_prompt=True
)

query_image_inputs, query_video_inputs = process_vision_info(query)

candidate_image_inputs, candidate_video_inputs = process_vision_info(candidate)

query_inputs = processor(
    text=[query_text],
    images=query_image_inputs,
    videos=query_video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

candidate_inputs = processor(
    text=[candidate_text],
    images=candidate_image_inputs,
    videos=candidate_video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")


# Encode Embeddings
with torch.inference_mode():
    query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
    query_embedding = query_outputs.hidden_states[-1][:,-1,:]
    query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
    print(query_embedding.shape)  # torch.Size([1, 2048])

    candidate_outputs = model(**candidate_inputs, return_dict=True, output_hidden_states=True)
    candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
    candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
    print(candidate_embedding.shape)  # torch.Size([1, 2048])

# Compute Similarity
similarity = torch.matmul(query_embedding, candidate_embedding.T)
print(similarity)  # tensor([[0.6992]], device='cuda:0', dtype=torch.bfloat16)

Evaluation

We provide the experiment results of CORAL on the MERIT dataset.

CORAL Structure

Performance of CORAL on MERIT

Citation

Chow W, Gao Y, Li L, et al. MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query[J]. arXiv preprint arXiv:2506.03144, 2025.

BibTeX:

@article{chow2025merit,
  title={MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query},
  author={Chow, Wei and Gao, Yuan and Li, Linfeng and Wang, Xian and Xu, Qi and Song, Hang and Kong, Lingdong and Zhou, Ran and Zeng, Yi and Cai, Yidong and others},
  journal={arXiv preprint arXiv:2506.03144},
  year={2025}
}

Downloads last month: 39

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bia/CORAL

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(808)

this model

Bia
/

CORAL