Update README.md

fe5b940 verified 8 months ago

3.62 kB

	---
	license: mit
	base_model:
	- mistralai/Pixtral-12B-2409
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- lora
	datasets:
	- Multimodal-Fatima/FGVC_Aircraft_train
	- takara-ai/FloodNet_2021-Track_2_Dataset_HF
	---

	<img src="https://takara.ai/images/logo-24/TakaraAi.svg" width="200" alt="Takara.ai Logo" />
	From the Frontier Research Team at Takara.ai we present a specialized LoRA adapter for aerial imagery analysis and visual question answering.

	---

	# pixtral_aerial_VQA_adapter

	## Overview
	This repository contains a fine-tuned LoRA adapter for the Pixtral-12B model, optimized specifically for aerial imagery analysis and visual question answering. The adapter enables detailed processing of aerial footage with a focus on construction site surveying, structural assessment, and environmental monitoring.

	## Model Details
	- Type: LoRA Adapter
	- Total Parameters: 6,225,920
	- Memory Usage: 23.75 MB
	- Precisions: torch.float32
	- Layer Types:
	- lora_A: 40
	- lora_B: 40
	- Base Model: [mistralai/Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409)

	## Capabilities
	The adapter enhances Pixtral's ability to:
	- Identify and describe construction elements in aerial imagery
	- Detect structural issues in buildings and infrastructure
	- Analyze progress in construction projects
	- Monitor environmental changes and flooding events
	- Process high-resolution aerial imagery with improved detail recognition

	## Intended Use
	- Primary intended uses: Processing aerial footage of construction sites for structural and construction surveying.
	- Can also be applied to any detailed VQA use cases with aerial footage.
	- Suitable for disaster response and assessment applications, particularly flood monitoring.

	## Training Data
	- Dataset:
	1. [FloodNet Track 2 dataset](https://huggingface.co/datasets/takara-ai/FloodNet_2021-Track_2_Dataset_HF)
	2. Subset of [FGVC Aircraft dataset](https://huggingface.co/datasets/Multimodal-Fatima/FGVC_Aircraft_train)
	3. Custom dataset of 10 image-caption pairs created using Pixtral

	## Training Procedure
	- Training method: LoRA (Low-Rank Adaptation)
	- Base model: Ertugrul/Pixtral-12B-Captioner-Relaxed
	- Training hardware: Nebius-hosted NVIDIA H100 machine

	## Usage Example

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	import torch
	from PIL import Image

	# Load model and processor
	model_id = "takara-ai/pixtral_aerial_VQA_adapter"
	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Load and process image
	image = Image.open("path_to_aerial_image.jpg")
	prompt = "Describe the construction progress visible in this aerial image."

	# Generate response
	inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
	generated_ids = model.generate(
	**inputs,
	max_new_tokens=512,
	do_sample=True,
	temperature=0.7
	)
	response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## Citation

	```bibtex
	@misc{rahnemoonfar2020floodnet,
	title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding},
	author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy},
	year={2020},
	eprint={2012.02951},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	doi={10.48550/arXiv.2012.02951}
	}
	```

	---
	For research inquiries and press, please reach out to [email protected]

	> 人類を変革する