Qwen2.5-VL-3B-R2R-low-level

Qwen2.5-VL-3B-R2R-low-level is a Vision-and-Language Navigation (VLN) model fine-tuned from Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240.

Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen.

🧠 Model Summary

Base Model: Qwen2.5-VL-3B-Instruct
Dataset: Room-to-Room (R2R) via the Matterport3D simulator.
Image Resolution: 320x240.
Action Space:
- Move: Move to the adjacent node closest to the center of the field of view.
- Left: Turn 30° to the left.
- Right: Turn 30° to the right.
- Stop: Select when the agent believes it has reached the goal.

🧪 Training Setup

Frozen Modules: Vision encoder and cross-modal projector
Fine-Tuned Module: LLM decoder (Qwen2.5)
Optimizer: AdamW
Batch Size: 1 (with gradient accumulation over each episode)
Learning Rate: 1e-5
Weight Decay: 0.1
Precision: bfloat16
LR Scheduler: Linear scheduler with warmup (first 10% of steps)
Hardware: Trained on a single NVIDIA A100 80GB GPU

Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320×240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset.

📦 Usage

import json
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import Dataset as DT
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image

class CustomDataset(Dataset):
    def __init__(self, data):
        self.text = data["text"]
        self.images = data["images"]
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, index):
        return self.text[index], self.images[index]

class CollateFunctor:
    # No batch, therefore no max length
    def __init__(self, processor, width, height):
        self.processor = processor
        self.width = width
        self.height = height

    def __call__(self, batch):
        text, images = batch[0]
        label_start = processor.tokenizer("<|im_start|>assistant\nAction: ", return_tensors="pt").input_ids

        images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images]

        processed = processor(text=text, images=[images], return_tensors="pt")

        prompt_input_ids = processed["input_ids"]
        input_ids = torch.cat([prompt_input_ids, label_start], dim=1)

        attention_mask = torch.ones(1, input_ids.shape[1])
        processed["input_ids"] = input_ids
        processed["attention_mask"] = attention_mask
        
        return processed

def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt):
    images = os.listdir(images_path)
    images = [os.path.join(images_path, img) for img in images]
    images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0]))

    current_image = images.pop(-1)
    
    content = [
            {
                "type" : "text", 
                #"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: "
                "text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: " 
            },
        ]

    for img in images:
        content.append({"type" : "image", "image" : img}) 

    if len(images) == 0:
        content[0]["text"] += f"[]"

    content.append(
            {
                "type" : "text", 
                "text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:"
            }
        )
    content.append(
            {
                "type" : "image", 
                "image" : current_image
            }
        )
    if move_possible:
        possible_actions = ["Left", "Right", "Move", "Stop"]

    else:
        possible_actions = ["Left", "Right", "Stop"]
        
    content.append(
            {
                "type" : "text", 
                "text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)"
            }
        )

    messages = [
            {"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]},
            {"role" : "user", "content" : content},
        ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    images.extend([current_image])
    
    formatted_sample = {}
    formatted_sample["text"] = text
    formatted_sample["images"] = images

    formatted_data = [formatted_sample] 
    formatted_data = DT.from_list(formatted_data)
    return formatted_data

# Load model and processor
processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Vebbern/Qwen2.5-VL-3B-R2R-low-level",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)

# remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained)
collate_fn = CollateFunctor(processor, 320, 240)

# Load mandatory system prompt
with open("system_prompt.txt", "r") as f:
    system_prompt = f.read()

path_id = 1021 # id for the R2R path
route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. "
images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png....
step_id = 2
distance = 8.223
previous_actions = ["Left", "Move"]
move_possible = True # if there are no nodes within the field of view this should be set to False

# This code will load all images in the path from step 0 up to the current step.
prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt)

dataset = CustomDataset(prompt)
data_loader = DataLoader(
    dataset,
    batch_size=1,
    collate_fn=collate_fn
)

# Run inference
for batch in data_loader:
    batch.to("cuda")
            
    outputs = model(**batch)
    argmax = torch.argmax(outputs.logits, dim=2)[0]
    model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more
    print(f"Predicted action: {model_prediction}")

⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo.

📊 Evaluation Results

The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics.

Metric	Val Seen	Val Unseen	Test
Path Length (↓)	10.27	10.50	10.59
Navigation Error (↓)	7.14	7.84	7.99
Oracle Success Rate (↑)	41%	34%	34%
Success Rate (↑)	35%	27%	26%
SPL (↑)	32%	24%	24%

🧾 Metric Definitions

Navigation Error: Mean distance from the goal when the agent stops.
Success Rate: Percentage of episodes where the agent ends within 3 meters of the goal.
SPL (Success weighted by Path Length): Penalizes long or inefficient paths.
Oracle Success: If the agent had stopped at its closest point to the goal.

📝 Remarks

While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space.

Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space.

🔁 Related Models

There also exists a panoramic action space eqivalent of this model.

Panoramic Action Space Version: Qwen2.5-VL-3B-R2R-panoramic

🪪 License

This model is licensed under the Apache License 2.0.

Downloads last month: 4

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vebbern/Qwen2.5-VL-3B-R2R-low-level

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(555)

this model

Quantizations

2 models