File size: 8,760 Bytes
e62b592 b9f4976 e62b592 b9f4976 e62b592 80ac31b e62b592 80ac31b db40a5b 6a48b51 db40a5b b6b867f 88ce792 b6b867f b37e7f9 b6b867f db40a5b b37e7f9 db40a5b 03da069 db40a5b b6b867f 88ce792 b6b867f b37e7f9 b6b867f db40a5b b37e7f9 db40a5b b37e7f9 db40a5b 03da069 db40a5b e62b592 db40a5b e62b592 db40a5b e62b592 db40a5b e62b592 db40a5b b9f4976 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
license: other
tags:
- llama-factory
- full
- generated_from_trainer
pipeline_tag: video-text-to-text
model-index:
- name: bal_imb_cap_full_lr2e-4_epoch10.0_freezevisTrue_fps8
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
## Model description
This model is a fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) on the current most, high-quality camera motion dataset that is publically available. This preview model is the current SOTA for classifying camera motion or being used for video-text retrieval with camera motion captions using [VQAScore](https://arxiv.org/pdf/2404.01291). Find more information about our work on our Github page for [CameraBench](https://github.com/sy77777en/CameraBench). *More updates to the benchmark and models will come in the future. Stay tuned!*
## Intended uses & limitations
The usage is identical to a [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) model. Our model is primarily useful for camera motion classification in videos as well as video-text retrieval (current SOTA in both tasks).
**A quick demo is shown below:**
<details>
<summary>Generative Scoring (for classification and retrieval):</summary>
We have two ways of using our model for this application. The first is the recommended `t2v_metrics` approach which we recommend. The latter is a back-up approach directly using Qwen2.5-VL's inference demo.
1. `t2v_metrics` Approach (recommended)
```python
# Install the package using: pip install git+https://github.com/chancharikmitra/t2v_metrics.git
import t2v_metrics
### For a single (video, text) pair:
qwen_score = t2v_metrics.VQAScore(model='qwen2.5-vl-7b', checkpoint='chancharikm/qwen2.5-vl-7b-cam-motion')
video = "videos/baby.mp4" # a video path in string format
text = "a baby crying"
# Calculate probability of "Yes" response
score = qwen_score(images=[video], texts=[text])
```
For more details, please refer to the t2v_metrics [fork](https://github.com/chancharikmitra/t2v_metrics.git).
2. Qwen2.5-VL Inference Code Approach
```python
# Import necessary libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Prepare input data
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"
# Format the input for the model
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": video_path,
"fps": 8.0, # Recommended FPS for optimal inference
},
{"type": "text", "text": question},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs
)
inputs = inputs.to("cuda")
# Generate with score output
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=1,
do_sample=False, # Use greedy decoding to get reliable logprobs
output_scores=True,
return_dict_in_generate=True
)
# Calculate probability of "Yes" response
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()
print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")
```
</details>
<details>
<summary>Natural Language Generation</summary>
We have two ways of using our model for this application. The first is the recommended `t2v_metrics` approach which we recommend. The latter is a back-up approach directly using Qwen2.5-VL's inference demo.
1. `t2v_metrics` Approach (recommended)
```python
# Install the package using: pip install git+https://github.com/chancharikmitra/t2v_metrics.git
import t2v_metrics
### For a single (video, text) pair:
qwen_score = t2v_metrics.VQAScore(model='qwen2.5-vl-7b', checkpoint='chancharikm/qwen2.5-vl-7b-cam-motion')
video = "videos/baby.mp4" # a video path in string format
text = "Please describe this image: "
# Calculate probability of "Yes" response
score = qwen_score.model.generate(images=[video], texts=[text])
```
For more details, please refer to the t2v_metrics [fork](https://github.com/chancharikmitra/t2v_metrics.git).
2. Qwen2.5-VL Inference Code Approach
```python
# The model is trained on 8.0 FPS which we recommend for optimal inference
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# "chancharikm/qwen2.5-vl-7b-cam-motion",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"fps": 8.0,
},
{"type": "text", "text": "Describe the camera motion in this video."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
fps=fps,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
</details>
## Training and evaluation data
Training and evaluation data can be found in our [repo](https://github.com/sy77777en/CameraBench).
## Training procedure
We use the LLaMA-Factory codebase to finetune our model. Please use the above data and the hyperparameters below to replicate our work if desired.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 4
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 8
- total_train_batch_size: 256
- total_eval_batch_size: 8
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10.0
<!-- ### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.0054 | 2.7191 | 1000 | 0.0100 |
| 0.0005 | 5.4358 | 2000 | 0.0036 |
| 0.0 | 8.1525 | 3000 | 0.0000 |
### Framework versions
- Transformers 4.51.0
- Pytorch 2.6.0+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0 -->
## ✏️ Citation
If you find this repository useful for your research, please use the following.
```
@article{lin2025camerabench,
title={Towards Understanding Camera Motions in Any Video},
author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
journal={arXiv preprint arXiv:2504.15376},
year={2025},
}
``` |