Image-Text-to-Text
Transformers
Safetensors
lora
File size: 3,617 Bytes
fe5b940
 
 
 
 
 
 
 
 
 
 
 
 
9a35221
fe5b940
9a35221
06e402e
7c1495b
fe5b940
 
 
 
7c1495b
9a35221
7c1495b
 
 
 
 
 
 
fe5b940
 
 
 
 
 
 
 
 
7c1495b
 
 
 
fe5b940
7c1495b
 
fe5b940
 
 
7c1495b
 
 
 
 
dbb98f9
 
fe5b940
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbb98f9
9a35221
fe5b940
9a35221
dbb98f9
fe5b940
 
 
 
 
 
 
dbb98f9
9a35221
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
base_model:
- mistralai/Pixtral-12B-2409
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- lora
datasets:
- Multimodal-Fatima/FGVC_Aircraft_train
- takara-ai/FloodNet_2021-Track_2_Dataset_HF
---

<img src="https://takara.ai/images/logo-24/TakaraAi.svg" width="200" alt="Takara.ai Logo" />
From the Frontier Research Team at **Takara.ai** we present a specialized LoRA adapter for aerial imagery analysis and visual question answering.

---

# pixtral_aerial_VQA_adapter

## Overview
This repository contains a fine-tuned LoRA adapter for the Pixtral-12B model, optimized specifically for aerial imagery analysis and visual question answering. The adapter enables detailed processing of aerial footage with a focus on construction site surveying, structural assessment, and environmental monitoring.

## Model Details
- **Type**: LoRA Adapter
- **Total Parameters**: 6,225,920
- **Memory Usage**: 23.75 MB
- **Precisions**: torch.float32
- **Layer Types**:
  - lora_A: 40
  - lora_B: 40
- **Base Model**: [mistralai/Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409)

## Capabilities
The adapter enhances Pixtral's ability to:
- Identify and describe construction elements in aerial imagery
- Detect structural issues in buildings and infrastructure
- Analyze progress in construction projects
- Monitor environmental changes and flooding events
- Process high-resolution aerial imagery with improved detail recognition

## Intended Use
- **Primary intended uses**: Processing aerial footage of construction sites for structural and construction surveying.
- Can also be applied to any detailed VQA use cases with aerial footage.
- Suitable for disaster response and assessment applications, particularly flood monitoring.

## Training Data
- **Dataset**:
  1. [FloodNet Track 2 dataset](https://huggingface.co/datasets/takara-ai/FloodNet_2021-Track_2_Dataset_HF)
  2. Subset of [FGVC Aircraft dataset](https://huggingface.co/datasets/Multimodal-Fatima/FGVC_Aircraft_train)
  3. Custom dataset of 10 image-caption pairs created using Pixtral

## Training Procedure
- **Training method**: LoRA (Low-Rank Adaptation)
- **Base model**: Ertugrul/Pixtral-12B-Captioner-Relaxed
- **Training hardware**: Nebius-hosted NVIDIA H100 machine

## Usage Example

```python
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image

# Load model and processor
model_id = "takara-ai/pixtral_aerial_VQA_adapter"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load and process image
image = Image.open("path_to_aerial_image.jpg")
prompt = "Describe the construction progress visible in this aerial image."

# Generate response
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7
)
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## Citation

```bibtex
@misc{rahnemoonfar2020floodnet,
 title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding},
 author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy},
 year={2020},
 eprint={2012.02951},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 doi={10.48550/arXiv.2012.02951}
}
```

---
For research inquiries and press, please reach out to [email protected]

> 人類を変革する