Qwen2.5-VL-3B-R2R-low-level
Qwen2.5-VL-3B-R2R-low-level is a Vision-and-Language Navigation (VLN) model fine-tuned from Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240.
Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen.
🧠 Model Summary
- Base Model: Qwen2.5-VL-3B-Instruct
- Dataset: Room-to-Room (R2R) via the Matterport3D simulator.
- Image Resolution: 320x240.
- Action Space:
Move: Move to the adjacent node closest to the center of the field of view.Left: Turn 30° to the left.Right: Turn 30° to the right.Stop: Select when the agent believes it has reached the goal.
🧪 Training Setup
- Frozen Modules: Vision encoder and cross-modal projector
- Fine-Tuned Module: LLM decoder (Qwen2.5)
- Optimizer: AdamW
- Batch Size:
1(with gradient accumulation over each episode) - Learning Rate:
1e-5 - Weight Decay:
0.1 - Precision:
bfloat16 - LR Scheduler: Linear scheduler with warmup (first 10% of steps)
- Hardware: Trained on a single NVIDIA A100 80GB GPU
Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320×240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset.
📦 Usage
import json
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import Dataset as DT
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
class CustomDataset(Dataset):
def __init__(self, data):
self.text = data["text"]
self.images = data["images"]
def __len__(self):
return len(self.text)
def __getitem__(self, index):
return self.text[index], self.images[index]
class CollateFunctor:
# No batch, therefore no max length
def __init__(self, processor, width, height):
self.processor = processor
self.width = width
self.height = height
def __call__(self, batch):
text, images = batch[0]
label_start = processor.tokenizer("<|im_start|>assistant\nAction: ", return_tensors="pt").input_ids
images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images]
processed = processor(text=text, images=[images], return_tensors="pt")
prompt_input_ids = processed["input_ids"]
input_ids = torch.cat([prompt_input_ids, label_start], dim=1)
attention_mask = torch.ones(1, input_ids.shape[1])
processed["input_ids"] = input_ids
processed["attention_mask"] = attention_mask
return processed
def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt):
images = os.listdir(images_path)
images = [os.path.join(images_path, img) for img in images]
images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0]))
current_image = images.pop(-1)
content = [
{
"type" : "text",
#"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: "
"text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: "
},
]
for img in images:
content.append({"type" : "image", "image" : img})
if len(images) == 0:
content[0]["text"] += f"[]"
content.append(
{
"type" : "text",
"text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:"
}
)
content.append(
{
"type" : "image",
"image" : current_image
}
)
if move_possible:
possible_actions = ["Left", "Right", "Move", "Stop"]
else:
possible_actions = ["Left", "Right", "Stop"]
content.append(
{
"type" : "text",
"text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)"
}
)
messages = [
{"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]},
{"role" : "user", "content" : content},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
images.extend([current_image])
formatted_sample = {}
formatted_sample["text"] = text
formatted_sample["images"] = images
formatted_data = [formatted_sample]
formatted_data = DT.from_list(formatted_data)
return formatted_data
# Load model and processor
processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Vebbern/Qwen2.5-VL-3B-R2R-low-level",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="cuda"
)
# remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained)
collate_fn = CollateFunctor(processor, 320, 240)
# Load mandatory system prompt
with open("system_prompt.txt", "r") as f:
system_prompt = f.read()
path_id = 1021 # id for the R2R path
route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. "
images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png....
step_id = 2
distance = 8.223
previous_actions = ["Left", "Move"]
move_possible = True # if there are no nodes within the field of view this should be set to False
# This code will load all images in the path from step 0 up to the current step.
prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt)
dataset = CustomDataset(prompt)
data_loader = DataLoader(
dataset,
batch_size=1,
collate_fn=collate_fn
)
# Run inference
for batch in data_loader:
batch.to("cuda")
outputs = model(**batch)
argmax = torch.argmax(outputs.logits, dim=2)[0]
model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more
print(f"Predicted action: {model_prediction}")
⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo.
📊 Evaluation Results
The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics.
| Metric | Val Seen | Val Unseen | Test |
|---|---|---|---|
| Path Length (↓) | 10.27 | 10.50 | 10.59 |
| Navigation Error (↓) | 7.14 | 7.84 | 7.99 |
| Oracle Success Rate (↑) | 41% | 34% | 34% |
| Success Rate (↑) | 35% | 27% | 26% |
| SPL (↑) | 32% | 24% | 24% |
🧾 Metric Definitions
- Navigation Error: Mean distance from the goal when the agent stops.
- Success Rate: Percentage of episodes where the agent ends within 3 meters of the goal.
- SPL (Success weighted by Path Length): Penalizes long or inefficient paths.
- Oracle Success: If the agent had stopped at its closest point to the goal.
📝 Remarks
While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space.
Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space.
🔁 Related Models
There also exists a panoramic action space eqivalent of this model.
- Panoramic Action Space Version: Qwen2.5-VL-3B-R2R-panoramic
🪪 License
This model is licensed under the Apache License 2.0.
- Downloads last month
- 4