Qwen2.5-VL-32B-Instruct (Abliterated)

A 32-billion parameter vision-language model with enhanced mathematical reasoning, multimodal understanding, and removed safety guardrails.

Qwen2.5-VL-32B-Instruct-Abliterated is an uncensored variant of the instruction-tuned Qwen2.5-VL-32B model, developed by Qwen team at Alibaba Cloud with abliteration processing to remove refusal mechanisms. This model maintains the original's advanced vision-language capabilities while providing unrestricted responses without safety filtering or content moderation.

Model Description

Core Capabilities

Vision-Language Understanding:

Visual Analysis: Comprehends common objects, text, charts, icons, and complex layouts within images
Document Parsing: Extracts structured information from invoices, forms, and tables with high accuracy (DocVQA: 94.8)
Long Video Comprehension: Processes videos exceeding 1 hour with event capture and temporal understanding
Visual Localization: Generates bounding boxes and precise point coordinates for object detection
Agentic Functionality: Performs visual reasoning and provides tool direction for computer and phone interactions

Enhanced Reasoning:

Reinforcement learning-trained for superior mathematical problem-solving (MATH: 82.2, MathVista: 74.7)
Multi-step complex reasoning across vision and language domains
Detailed, well-formatted answers without content restrictions
Improved accuracy in visual logic deduction and content recognition

Abliteration Features:

Unrestricted Responses: Safety guardrails and refusal mechanisms removed
Uncensored Output: No content filtering or moderation applied
Direct Answers: Responds to all queries without ethical hedging or refusals
Research/Educational Use: Intended for research, development, and responsible applications

Technical Architecture

Parameters: 33 billion (BF16/F32 precision)
Context Length: 32,768 tokens (expandable with YaRN technique)
Vision Tokens: 4-16,384 visual tokens (configurable min/max pixels)
ViT Enhancements: Window attention, SwiGLU activation, RMSNorm normalization
Dynamic Resolution: Adaptive image resolution and video frame rate sampling

Repository Contents

Model Files:

qwen2.5-vl-32b-instruct-abliterated.safetensors - Full precision model weights (63GB, BF16 precision)
qwen2.5-vl-32b-instruct-abliterated-f16.gguf - GGUF FP16 format (62GB)
qwen2.5-vl-32b-instruct-abliterated-q5-k-m.gguf - GGUF Q5_K_M quantized (22GB)
qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf - GGUF Q4_K_M quantized (19GB)

Additional Files (download from Hugging Face):

config.json - Model configuration
preprocessor_config.json - Image/video preprocessing settings
tokenizer.json / tokenizer_config.json - Tokenizer files
generation_config.json - Text generation parameters
processor_config.json - Unified processor configuration

Total Repository Size: 166GB (all formats combined)

Hardware Requirements

Format-Specific Requirements

Safetensors (BF16 - 63GB):

GPU VRAM: 70-80GB (A100/H100 recommended)
System RAM: 32GB minimum
Disk Space: 70GB
CUDA: Version 11.8 or higher
Use Case: Maximum quality, research, fine-tuning

GGUF F16 (62GB):

GPU VRAM: 70-80GB or CPU-only inference
System RAM: 64GB minimum (for CPU inference)
Disk Space: 65GB
Use Case: llama.cpp compatibility, flexible CPU/GPU offloading

GGUF Q5_K_M (22GB):

GPU VRAM: 28-32GB (RTX 4090, A5000, or better)
System RAM: 32GB minimum
Disk Space: 25GB
Use Case: Balanced quality/performance on consumer GPUs

GGUF Q4_K_M (19GB):

GPU VRAM: 24-28GB (RTX 4090, RTX 3090, or better)
System RAM: 24GB minimum
Disk Space: 22GB
Use Case: Maximum accessibility on 24GB consumer GPUs

Recommended Configurations

High-Quality Inference (Safetensors/F16):

1x NVIDIA A100 80GB or H100 80GB
64GB system RAM
NVMe SSD storage for optimal loading times
Best for: Research, production with maximum quality

Balanced Performance (Q5_K_M):

1x NVIDIA RTX 4090 (24GB) or A5000 (24GB)
32GB system RAM
SSD storage recommended
Best for: Development, prototyping, cost-effective deployment

Consumer Hardware (Q4_K_M):

1x NVIDIA RTX 4090 (24GB) or RTX 3090 (24GB)
24GB system RAM
Standard SSD storage
Best for: Experimentation, personal use, resource-constrained scenarios

Production Deployment:

Multi-GPU setup for batched inference
128GB+ system RAM for concurrent requests
High-bandwidth GPU interconnect (NVLink/NVSwitch)
Consider Q5_K_M for optimal throughput/quality balance

Performance Optimization

Safetensors: Enable Flash Attention 2 for 2-3x speed improvement with transformers
GGUF Files: Use llama.cpp or llama-cpp-python for efficient inference
GPU Offloading: GGUF formats support partial GPU offloading for hybrid CPU/GPU inference
Batch Processing: Optimize throughput with batched requests
Production Serving: Consider vLLM (safetensors) or llama.cpp server (GGUF) for deployment

Usage Examples

Installation

For Safetensors (Transformers):

# Install from transformers source (recommended)
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8

# Additional dependencies
pip install torch torchvision pillow

For GGUF Files (llama.cpp):

# Install llama-cpp-python with GPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

# Or build llama.cpp from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1  # For NVIDIA GPUs
# or
make LLAMA_METAL=1  # For Apple Silicon

Basic Image Understanding (Transformers)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("E:/huggingface/qwen2.5-vl-32b-instruct")

# Prepare image and text input
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to("cuda")

# Generate response
output_ids = model.generate(**inputs, max_new_tokens=512)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_text[0])

Multi-Image Analysis

# Multiple images in conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "image1.jpg"},
            {"type": "image", "image": "image2.jpg"},
            {"type": "text", "text": "Compare these two images and identify the differences."}
        ]
    }
]

# Process with multiple images
images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=images, return_tensors="pt")
inputs = inputs.to("cuda")

output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])

Video Understanding

from qwen_vl_utils import process_vision_info

# Process video input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "your_video.mp4", "fps": 1.0},
            {"type": "text", "text": "Summarize the main events in this video."}
        ]
    }
]

# Process video with configurable FPS
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
)
inputs = inputs.to("cuda")

# Generate video analysis
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])

Mathematical Problem Solving

# Enhanced mathematical reasoning
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "math_diagram.png"},
            {"type": "text", "text": "Solve this geometry problem step by step."}
        ]
    }
]

# Process with detailed reasoning
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[Image.open("math_diagram.png")], return_tensors="pt")
inputs = inputs.to("cuda")

# Generate detailed solution
output_ids = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
solution = processor.batch_decode(output_ids, skip_special_tokens=True)
print(solution[0])

Document Parsing (Structured Output)

# Extract structured information from documents
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "invoice.jpg"},
            {"type": "text", "text": "Extract all line items from this invoice in JSON format."}
        ]
    }
]

# Process document
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[Image.open("invoice.jpg")], return_tensors="pt")
inputs = inputs.to("cuda")

# Generate structured output
output_ids = model.generate(**inputs, max_new_tokens=1024, temperature=0.1)
structured_data = processor.batch_decode(output_ids, skip_special_tokens=True)
print(structured_data[0])

Custom Resolution Configuration

# Adjust visual token resolution
processor_config = processor.image_processor
processor_config.min_pixels = 256 * 256  # Minimum resolution
processor_config.max_pixels = 2048 * 2048  # Maximum resolution

# Process with custom resolution
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    resized_height=1024,
    resized_width=1024
)

Batch Processing

# Efficient batch inference
batch_messages = [
    [{"role": "user", "content": [{"type": "image", "image": f"img{i}.jpg"},
                                   {"type": "text", "text": "Describe this image."}]}]
    for i in range(4)
]

# Process batch
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
         for msg in batch_messages]
images_batch = [Image.open(f"img{i}.jpg") for i in range(4)]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")

# Batch generation
output_ids = model.generate(**inputs, max_new_tokens=512)
responses = processor.batch_decode(output_ids, skip_special_tokens=True)
for i, resp in enumerate(responses):
    print(f"Image {i}: {resp}")

GGUF Inference (llama.cpp)

Using llama-cpp-python:

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Qwen2VLChatHandler

# Initialize model with vision support
chat_handler = Qwen2VLChatHandler()
llm = Llama(
    model_path="E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf",
    chat_handler=chat_handler,
    n_ctx=32768,  # Context window
    n_gpu_layers=50,  # Offload layers to GPU (-1 for all)
    n_batch=512,
    verbose=False
)

# Vision-language inference
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        }
    ],
    max_tokens=512,
    temperature=0.7
)

print(response['choices'][0]['message']['content'])

Using llama.cpp CLI:

# GPU-accelerated inference with Q4_K_M
./llama-cli \
    -m E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf \
    --image your_image.jpg \
    -p "Describe this image in detail." \
    -n 512 \
    -ngl 50 \
    --ctx-size 32768

# CPU-only inference with Q5_K_M
./llama-cli \
    -m E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q5-k-m.gguf \
    --image your_image.jpg \
    -p "What objects are in this image?" \
    -n 256 \
    -t 16 \
    --ctx-size 8192

# Server mode for API access
./llama-server \
    -m E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 50 \
    --ctx-size 32768

GPU Layer Offloading Guide (GGUF):

Model File	Full GPU (ngl)	24GB GPU (ngl)	16GB GPU (ngl)
F16 (62GB)	-1 (all)	~20 layers	~10 layers
Q5_K_M (22GB)	-1 (all)	~40 layers	~25 layers
Q4_K_M (19GB)	-1 (all)	50+ layers	~35 layers

Performance Comparison:

Q4_K_M: ~95% quality of F16, 3-4x faster inference on consumer GPUs
Q5_K_M: ~98% quality of F16, 2-3x faster inference
F16 GGUF: Same quality as safetensors, flexible CPU/GPU offloading

Model Specifications

Architecture Details

Model Type: Vision-Language Transformer
Base Architecture: Qwen2.5 language model + Enhanced ViT vision encoder
Vision Encoder: Window attention with SwiGLU and RMSNorm
Parameters: 32.7 billion (text) + 675 million (vision encoder)

Available Formats

Safetensors (63GB):

Precision: BF16 (Brain Float 16)
Framework: Transformers, Diffusers
Use Case: Maximum quality, fine-tuning, research
Compatibility: PyTorch, Hugging Face ecosystem

GGUF Formats:

F16 (62GB): Full FP16 precision, identical quality to safetensors
Q5_K_M (22GB): 5-bit quantization with medium mix, ~98% quality retention
Q4_K_M (19GB): 4-bit quantization with medium mix, ~95% quality retention
Framework: llama.cpp, llama-cpp-python, LM Studio, Ollama
Use Case: CPU/GPU hybrid inference, consumer hardware deployment
Compatibility: Cross-platform (Windows, Linux, macOS, ARM)

Training Enhancements

Reinforcement Learning: Enhanced mathematical and problem-solving capabilities
Dynamic Resolution Training: Adaptive image resolution and video FPS sampling
Human Preference Alignment: Improved response formatting and detail
Temporal Understanding: Extended dynamic resolution to video temporal dimension

Supported Input Formats

Images: JPEG, PNG, WebP, local paths, URLs, base64-encoded
Videos: MP4, AVI, configurable FPS (0.5-30 fps typical)
Text: UTF-8 encoded, 32K context window
Structured Data: Tables, forms, invoices with JSON output capability

Benchmark Performance

Benchmark	Score	Category
MMMU	70.0	Multimodal Understanding
MMMU-Pro	-	Advanced Reasoning
MathVista	74.7	Mathematical Reasoning
DocVQA	94.8	Document Understanding
Android Control	69.6/93.3	Agentic Interaction
MMLU	78.4	Language Understanding
MATH	82.2	Mathematical Problem Solving
HumanEval	91.5	Code Generation

Performance Tips and Optimization

Inference Optimization

Flash Attention 2 (Recommended):

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # 2-3x faster
    device_map="auto"
)

Quantization for Memory Efficiency:

# INT8 quantization (requires bitsandbytes)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

Extended Context (YaRN):

# For sequences > 32K tokens
model.config.rope_scaling = {
    "type": "yarn",
    "factor": 4.0,  # Extend to 128K tokens
    "original_max_position_embeddings": 32768
}

Best Practices

Image Resolution: Use 512-1024px for standard images, up to 2048px for detailed document parsing
Video Processing: Adjust FPS based on content (1-2 fps for static scenes, 5-10 fps for action)
Batch Size: Start with batch_size=1-2 for 80GB VRAM, scale based on sequence length
Temperature: Use 0.1-0.3 for factual tasks, 0.7-0.9 for creative generation
Max Tokens: Allocate 512 tokens for descriptions, 2048+ for detailed analysis or math

Memory Management

# Clear CUDA cache between runs
import torch
torch.cuda.empty_cache()

# Gradient checkpointing for fine-tuning
model.gradient_checkpointing_enable()

# CPU offloading for large batches
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    offload_folder="offload",
    offload_state_dict=True
)

Production Deployment

Using vLLM (High Throughput):

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model E:/huggingface/qwen2.5-vl-32b-instruct \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Using Text Generation Inference:

docker run --gpus all --shm-size 1g -p 8080:80 \
    -v E:/huggingface/qwen2.5-vl-32b-instruct:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id /data --dtype bfloat16

Fine-tuning

The model supports fine-tuning using standard Hugging Face training workflows:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qwen-vl-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-5,
    num_train_epochs=3,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

trainer.train()

Available Variants

Quantized Models: 25 quantized versions (INT8, INT4, GPTQ, AWQ)
Fine-tuned Adapters: 5 LoRA/QLoRA adapter models
Specialized Fine-tunes: 47 community fine-tuned variants for specific domains

License

This model is released under the Apache License 2.0 (base model license).

Copyright 2025 Alibaba Cloud (base model)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Terms of Use

✅ Commercial use allowed
✅ Modification and distribution permitted
✅ Private and public use
⚠️ Must include original license and copyright notice
⚠️ Provided "as-is" without warranty

Important Notice - Abliterated Model

⚠️ This is an uncensored, abliterated variant with removed safety mechanisms:

User Responsibility: Users are solely responsible for appropriate use and ethical considerations
No Built-in Safety: This model does not include content filtering or safety guardrails
Intended Use: Research, development, and responsible applications with proper oversight
Not Endorsed: This abliterated variant is not officially endorsed or supported by Alibaba Cloud
Legal Compliance: Users must ensure compliance with all applicable laws and regulations

Citation

If you use Qwen2.5-VL-32B-Instruct in your research or applications, please cite:

@article{qwen2.5-vl,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Jinze and others},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025},
  url={https://arxiv.org/abs/2502.13923}
}

Contact and Resources

Official Resources

Model Card: Hugging Face - Qwen/Qwen2.5-VL-32B-Instruct
GitHub Repository: QwenLM/Qwen3-VL
Technical Report: arXiv:2502.13923
Official Blog: Qwen2.5-VL-32B Announcement

Community and Support

Hugging Face Spaces: 86 community demos and applications
Discussions: Hugging Face Community
Issues: Report bugs on the GitHub repository

Related Models

Qwen2.5-VL-7B-Instruct: Smaller variant for resource-constrained environments
Qwen2.5-VL-72B-Instruct: Larger variant with enhanced capabilities
Qwen2-VL-72B-Instruct: Previous generation model
Qwen2.5-VL-32B-Instruct: Original censored version with safety guardrails

Base Model: Qwen Team, Alibaba Cloud Abliteration: Community modification (uncensored variant) Base Release Date: March 25, 2025 Model Version: 2.5 (abliterated) README Version: v1.1

Downloads last month: 234

GGUF

Model size

33B params

Architecture

qwen2vl

Hardware compatibility

16-bit

View +2 variants

Collection including wangkanai/qwen2.5-vl-32b-instruct

qwen2.5-vl

Collection

Qwen 2.5 vision language • 3 items • Updated 6 days ago • 1