Model Details
Perception Language Model (PLM) is a state-of-the-art, fully open and reproducible MLLM for transparent research in image and video understanding. It was introduced in "PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding".
Model Overview: PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder. We start by an analysis of standard training pipelines with available data, without any proprietary model distillation. We investigate large-scale synthetic data and establish key scaling laws to identify critical data gaps that limit video understanding performance, especially for spatio-temporal reasoning and fine-grained understanding tasks. To fill these gaps, we create 2.8M high-quality human-labeled. This release is nearly an order of magnitude larger than the largest existing video datasets.
We provide the training and evaluation code at perception_models codebase. You can find more details in the GitHub repo.
| Resource | Description | Documentation |
|---|---|---|
| Evaluation | Evaluation of PLM using lmms-eval | docs/evaluation.md |
| Training / Finetuning | Training and finetuning instructions for PLM | docs/training.md |
| PLM-VideoBench | Evaluation on PLM-VideoBench using lmms-eval | docs/plm_videobench.md |
| End-to-End Finetuning Example | End-to-end finetuning example on radiology images | docs/finetune_example.md |
| Generating Response | Generate responses using a trained model with generate.py |
generate.py |
PLM Image Benchmark Results
| Model | DocVQA | ChartQA | TextVQA | InfoQA | AI2D | OCRBench | COCO | Nocap | Flickr | MMMU | VQAv2 | OKVQA | VizWiz | MME | SEED | BLINK | CVBench | RealWorldQA | VSR | POPE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PLM1B | 90.7 | 78.6 | 82.1 | 63.0 | 84.9 | 807 | 138.6 | 124.2 | 100.5 | 34.8 | 81.7 | 61.0 | 59.7 | 1603 | 76.3 | 46.8 | 73.8 | 67.1 | 68.8 | 88.4 |
| PLM3B | 93.8 | 84.3 | 84.3 | 74.6 | 90.9 | 830 | 144.9 | 126.5 | 98.0 | 41.2 | 84.3 | 66.8 | 64.0 | 1879 | 78.5 | 55.4 | 81.4 | 72.4 | 80.4 | 88.7 |
| PLM8B | 94.6 | 85.5 | 86.5 | 80.9 | 92.7 | 870 | 146.7 | 129.9 | 105.6 | 46.1 | 85.6 | 69.6 | 67.0 | 1989 | 79.3 | 56.0 | 81.3 | 75.0 | 82.8 | 89.9 |
PLM Video Benchmark Results
| Model | VATEX | DREAM 1K | How2QA | MVBench | NExTQA | PerceptionTest (test) | STAR | TVQA | VideoMME | TVBench | ActivityNetQA | EgoSchema (test) | TemporalBench | TOMATO | MotionBench (dev) | TempCompass (MCQ) | CGBench (clue) | Charades STA | VideoHallucer | Halluc. EventHallusion |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PLM1B | 92.5 | 34.3 | 86.4 | 70.1 | 80.3 | 72.7 | 83.7 | 50.3 | 49.2 | 50.4 | 62.5 | 60.4 | 18.2 | 25.5 | 52.2 | 64.6 | 43.6 | 55.2 | 49.2 | 79.5 |
| PLM3B | 96.1 | 37.4 | 89.4 | 74.7 | 83.4 | 79.3 | 84.8 | 55.3 | 54.9 | 58.9 | 66.2 | 66.9 | 23.4 | 30.9 | 60.4 | 69.3 | 47.2 | 57.7 | 55.5 | 76.5 |
| PLM8B | 99.7 | 35.9 | 90.7 | 77.1 | 84.1 | 82.7 | 84.9 | 59.3 | 58.3 | 63.5 | 67.3 | 68.8 | 28.3 | 33.2 | 61.4 | 72.7 | 46.4 | 58.6 | 57.7 | 77.3 |
PLM Usage
Files here work with huggingface transformers library while original checkpoints that work with perception_models are under original/.
To use latest transformers library,
pip install --upgrade git+https://github.com/huggingface/transformers.git
PLM example with image input:
from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download
MODEL_PATH = "facebook/Perception-LM-3B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")
test_image_file = hf_hub_download(
repo_id="shumingh/perception_lm_test_images",
filename="14496_0.PNG",
repo_type="dataset",
)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
"url": test_image_file,
},
{"type": "text", "text": "Describe the bar plot in the image."},
],
}
]
inputs = processor.apply_chat_template(
[conversation],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=256)
input_length = inputs["input_ids"].shape[1]
generate_ids_without_inputs = generate_ids[:, input_length:]
for output in processor.batch_decode(generate_ids_without_inputs, skip_special_tokens=True):
print(output)
PLM example with video input:
from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download
MODEL_PATH = "facebook/Perception-LM-3B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")
video_file = hf_hub_download(
repo_id="shumingh/perception_lm_test_videos",
filename="GUWR5TyiY-M_000012_000022.mp4",
repo_type="dataset",
)
conversation = [
{
"role": "user",
"content": [
{
"type": "video",
"url": video_file,
},
{"type": "text", "text": "Can you describe the video in detail?"},
],
}
]
inputs = processor.apply_chat_template(
[conversation],
num_frames=32,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_load_backend="decord",
)
inputs = inputs.to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=256)
input_length = inputs["input_ids"].shape[1]
generate_ids_without_inputs = generate_ids[:, input_length:]
for output in processor.batch_decode(
generate_ids_without_inputs, skip_special_tokens=True
):
print(output)
Simple PLM Fine-tuning Example
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, AutoModelForImageTextToText
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
from torch.nn.utils.rnn import pad_sequence
USE_LORA = False
USE_QLORA = False
model_id = "facebook/Perception-LM-3B"
processor = AutoProcessor.from_pretrained(model_id)
# Default is "thumb+tile" which corresponds to a max of 36 tiles
# We use "vanilla" which corresponds to a single tile here
# 1) to save memory and 2) to simplify demo code
processor.image_processor.vision_input_type = "vanilla"
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
target_modules=[
"down_proj",
"o_proj",
"k_proj",
"q_proj",
"gate_proj",
"up_proj",
"v_proj",
],
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian",
)
lora_config.inference_mode = False
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
quantization_config=bnb_config if USE_QLORA else None,
device_map="auto",
)
model.add_adapter(lora_config)
model.enable_adapters()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
print(model.get_nb_trainable_parameters())
else:
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
).to("cuda")
# if you'd like to only fine-tune LLM
for param in model.model.vision_tower.parameters():
param.requires_grad = False
peak_mem = torch.cuda.max_memory_allocated()
print(f"The model as is is holding: {peak_mem / 1024**3:.2f} of GPU RAM")
ds = load_dataset("unsloth/Radiology_mini")
split_ds = ds["train"].train_test_split(test_size=0.1)
train_ds = split_ds["train"]
print(f"prompt: {train_ds[0]['caption']}, image_id: {train_ds[0]['image_id']}")
image_token_id = processor.tokenizer.image_token_id
def collate_fn(examples):
instances = []
for example in examples:
caption = example["caption"]
user_content = [
{
"type": "text",
"text": "You are an expert radiographer. Describe accurately what you see in this image.",
}
]
user_content.append({"type": "image", "image": example["image"]})
messages = [
{"role": "user", "content": user_content},
{"role": "assistant", "content": [{"type": "text", "text": f"{caption}"}]},
]
instance = (
processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
)
.to("cuda")
.to(model.dtype)
)
instances.append(instance)
input_ids = pad_sequence(
[inst["input_ids"].squeeze(0) for inst in instances],
batch_first=True,
padding_value=processor.tokenizer.pad_token_id,
)
attention_mask = pad_sequence(
[inst["attention_mask"].squeeze(0) for inst in instances],
batch_first=True,
padding_value=0,
)
labels = pad_sequence(
[inst["input_ids"].squeeze(0).clone() for inst in instances],
batch_first=True,
padding_value=-100,
)
pixel_values = torch.cat(
[inst["pixel_values"] for inst in instances], dim=0
)
labels[labels == image_token_id] = -100
out = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels,
"pixel_values": pixel_values,
}
return out
model_name = model_id.split("/")[-1]
training_args = TrainingArguments(
num_train_epochs=1,
per_device_train_batch_size=8,
gradient_accumulation_steps=1,
warmup_steps=50,
learning_rate=1e-4,
weight_decay=0.01,
logging_steps=25,
save_strategy="steps",
save_steps=250,
save_total_limit=1,
optim="adamw_torch", # for 8-bit, keep paged_adamw_8bit, else adamw_hf
bf16=True,
output_dir=f"./{model_name}-radiology-mini",
hub_model_id=f"{model_name}-radiology-mini",
remove_unused_columns=False,
report_to="tensorboard",
dataloader_pin_memory=False,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=collate_fn,
train_dataset=train_ds,
)
trainer.train()
Citation
If you find our code useful for your research, please consider citing:
@article{cho2025PerceptionLM,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
@article{bolya2025PerceptionEncoder,
title={Perception Encoder: The best visual embeddings are not at the output of the network},
author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
- Downloads last month
- 4