Qwen2.5-VL-32B-Instruct (Abliterated)
A 32-billion parameter vision-language model with enhanced mathematical reasoning, multimodal understanding, and removed safety guardrails.
Qwen2.5-VL-32B-Instruct-Abliterated is an uncensored variant of the instruction-tuned Qwen2.5-VL-32B model, developed by Qwen team at Alibaba Cloud with abliteration processing to remove refusal mechanisms. This model maintains the original's advanced vision-language capabilities while providing unrestricted responses without safety filtering or content moderation.
Model Description
Core Capabilities
Vision-Language Understanding:
- Visual Analysis: Comprehends common objects, text, charts, icons, and complex layouts within images
 - Document Parsing: Extracts structured information from invoices, forms, and tables with high accuracy (DocVQA: 94.8)
 - Long Video Comprehension: Processes videos exceeding 1 hour with event capture and temporal understanding
 - Visual Localization: Generates bounding boxes and precise point coordinates for object detection
 - Agentic Functionality: Performs visual reasoning and provides tool direction for computer and phone interactions
 
Enhanced Reasoning:
- Reinforcement learning-trained for superior mathematical problem-solving (MATH: 82.2, MathVista: 74.7)
 - Multi-step complex reasoning across vision and language domains
 - Detailed, well-formatted answers without content restrictions
 - Improved accuracy in visual logic deduction and content recognition
 
Abliteration Features:
- Unrestricted Responses: Safety guardrails and refusal mechanisms removed
 - Uncensored Output: No content filtering or moderation applied
 - Direct Answers: Responds to all queries without ethical hedging or refusals
 - Research/Educational Use: Intended for research, development, and responsible applications
 
Technical Architecture
- Parameters: 33 billion (BF16/F32 precision)
 - Context Length: 32,768 tokens (expandable with YaRN technique)
 - Vision Tokens: 4-16,384 visual tokens (configurable min/max pixels)
 - ViT Enhancements: Window attention, SwiGLU activation, RMSNorm normalization
 - Dynamic Resolution: Adaptive image resolution and video frame rate sampling
 
Repository Contents
Model Files:
qwen2.5-vl-32b-instruct-abliterated.safetensors- Full precision model weights (63GB, BF16 precision)qwen2.5-vl-32b-instruct-abliterated-f16.gguf- GGUF FP16 format (62GB)qwen2.5-vl-32b-instruct-abliterated-q5-k-m.gguf- GGUF Q5_K_M quantized (22GB)qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf- GGUF Q4_K_M quantized (19GB)
Additional Files (download from Hugging Face):
config.json- Model configurationpreprocessor_config.json- Image/video preprocessing settingstokenizer.json/tokenizer_config.json- Tokenizer filesgeneration_config.json- Text generation parametersprocessor_config.json- Unified processor configuration
Total Repository Size: 166GB (all formats combined)
Hardware Requirements
Format-Specific Requirements
Safetensors (BF16 - 63GB):
- GPU VRAM: 70-80GB (A100/H100 recommended)
 - System RAM: 32GB minimum
 - Disk Space: 70GB
 - CUDA: Version 11.8 or higher
 - Use Case: Maximum quality, research, fine-tuning
 
GGUF F16 (62GB):
- GPU VRAM: 70-80GB or CPU-only inference
 - System RAM: 64GB minimum (for CPU inference)
 - Disk Space: 65GB
 - Use Case: llama.cpp compatibility, flexible CPU/GPU offloading
 
GGUF Q5_K_M (22GB):
- GPU VRAM: 28-32GB (RTX 4090, A5000, or better)
 - System RAM: 32GB minimum
 - Disk Space: 25GB
 - Use Case: Balanced quality/performance on consumer GPUs
 
GGUF Q4_K_M (19GB):
- GPU VRAM: 24-28GB (RTX 4090, RTX 3090, or better)
 - System RAM: 24GB minimum
 - Disk Space: 22GB
 - Use Case: Maximum accessibility on 24GB consumer GPUs
 
Recommended Configurations
High-Quality Inference (Safetensors/F16):
- 1x NVIDIA A100 80GB or H100 80GB
 - 64GB system RAM
 - NVMe SSD storage for optimal loading times
 - Best for: Research, production with maximum quality
 
Balanced Performance (Q5_K_M):
- 1x NVIDIA RTX 4090 (24GB) or A5000 (24GB)
 - 32GB system RAM
 - SSD storage recommended
 - Best for: Development, prototyping, cost-effective deployment
 
Consumer Hardware (Q4_K_M):
- 1x NVIDIA RTX 4090 (24GB) or RTX 3090 (24GB)
 - 24GB system RAM
 - Standard SSD storage
 - Best for: Experimentation, personal use, resource-constrained scenarios
 
Production Deployment:
- Multi-GPU setup for batched inference
 - 128GB+ system RAM for concurrent requests
 - High-bandwidth GPU interconnect (NVLink/NVSwitch)
 - Consider Q5_K_M for optimal throughput/quality balance
 
Performance Optimization
- Safetensors: Enable Flash Attention 2 for 2-3x speed improvement with transformers
 - GGUF Files: Use llama.cpp or llama-cpp-python for efficient inference
 - GPU Offloading: GGUF formats support partial GPU offloading for hybrid CPU/GPU inference
 - Batch Processing: Optimize throughput with batched requests
 - Production Serving: Consider vLLM (safetensors) or llama.cpp server (GGUF) for deployment
 
Usage Examples
Installation
For Safetensors (Transformers):
# Install from transformers source (recommended)
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8
# Additional dependencies
pip install torch torchvision pillow
For GGUF Files (llama.cpp):
# Install llama-cpp-python with GPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
# Or build llama.cpp from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1  # For NVIDIA GPUs
# or
make LLAMA_METAL=1  # For Apple Silicon
Basic Image Understanding (Transformers)
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("E:/huggingface/qwen2.5-vl-32b-instruct")
# Prepare image and text input
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
# Generate response
output_ids = model.generate(**inputs, max_new_tokens=512)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_text[0])
Multi-Image Analysis
# Multiple images in conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "image1.jpg"},
            {"type": "image", "image": "image2.jpg"},
            {"type": "text", "text": "Compare these two images and identify the differences."}
        ]
    }
]
# Process with multiple images
images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=images, return_tensors="pt")
inputs = inputs.to("cuda")
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])
Video Understanding
from qwen_vl_utils import process_vision_info
# Process video input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "your_video.mp4", "fps": 1.0},
            {"type": "text", "text": "Summarize the main events in this video."}
        ]
    }
]
# Process video with configurable FPS
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
)
inputs = inputs.to("cuda")
# Generate video analysis
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])
Mathematical Problem Solving
# Enhanced mathematical reasoning
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "math_diagram.png"},
            {"type": "text", "text": "Solve this geometry problem step by step."}
        ]
    }
]
# Process with detailed reasoning
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[Image.open("math_diagram.png")], return_tensors="pt")
inputs = inputs.to("cuda")
# Generate detailed solution
output_ids = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
solution = processor.batch_decode(output_ids, skip_special_tokens=True)
print(solution[0])
Document Parsing (Structured Output)
# Extract structured information from documents
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "invoice.jpg"},
            {"type": "text", "text": "Extract all line items from this invoice in JSON format."}
        ]
    }
]
# Process document
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[Image.open("invoice.jpg")], return_tensors="pt")
inputs = inputs.to("cuda")
# Generate structured output
output_ids = model.generate(**inputs, max_new_tokens=1024, temperature=0.1)
structured_data = processor.batch_decode(output_ids, skip_special_tokens=True)
print(structured_data[0])
Custom Resolution Configuration
# Adjust visual token resolution
processor_config = processor.image_processor
processor_config.min_pixels = 256 * 256  # Minimum resolution
processor_config.max_pixels = 2048 * 2048  # Maximum resolution
# Process with custom resolution
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    resized_height=1024,
    resized_width=1024
)
Batch Processing
# Efficient batch inference
batch_messages = [
    [{"role": "user", "content": [{"type": "image", "image": f"img{i}.jpg"},
                                   {"type": "text", "text": "Describe this image."}]}]
    for i in range(4)
]
# Process batch
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
         for msg in batch_messages]
images_batch = [Image.open(f"img{i}.jpg") for i in range(4)]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
# Batch generation
output_ids = model.generate(**inputs, max_new_tokens=512)
responses = processor.batch_decode(output_ids, skip_special_tokens=True)
for i, resp in enumerate(responses):
    print(f"Image {i}: {resp}")
GGUF Inference (llama.cpp)
Using llama-cpp-python:
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Qwen2VLChatHandler
# Initialize model with vision support
chat_handler = Qwen2VLChatHandler()
llm = Llama(
    model_path="E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf",
    chat_handler=chat_handler,
    n_ctx=32768,  # Context window
    n_gpu_layers=50,  # Offload layers to GPU (-1 for all)
    n_batch=512,
    verbose=False
)
# Vision-language inference
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        }
    ],
    max_tokens=512,
    temperature=0.7
)
print(response['choices'][0]['message']['content'])
Using llama.cpp CLI:
# GPU-accelerated inference with Q4_K_M
./llama-cli \
    -m E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf \
    --image your_image.jpg \
    -p "Describe this image in detail." \
    -n 512 \
    -ngl 50 \
    --ctx-size 32768
# CPU-only inference with Q5_K_M
./llama-cli \
    -m E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q5-k-m.gguf \
    --image your_image.jpg \
    -p "What objects are in this image?" \
    -n 256 \
    -t 16 \
    --ctx-size 8192
# Server mode for API access
./llama-server \
    -m E:/huggingface/qwen2.5-vl-32b-instruct/qwen2.5-vl-32b-instruct-abliterated-q4-k-m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 50 \
    --ctx-size 32768
GPU Layer Offloading Guide (GGUF):
| Model File | Full GPU (ngl) | 24GB GPU (ngl) | 16GB GPU (ngl) | CPU Only (ngl) | 
|---|---|---|---|---|
| F16 (62GB) | -1 (all) | ~20 layers | ~10 layers | 0 | 
| Q5_K_M (22GB) | -1 (all) | ~40 layers | ~25 layers | 0 | 
| Q4_K_M (19GB) | -1 (all) | 50+ layers | ~35 layers | 0 | 
Performance Comparison:
- Q4_K_M: ~95% quality of F16, 3-4x faster inference on consumer GPUs
 - Q5_K_M: ~98% quality of F16, 2-3x faster inference
 - F16 GGUF: Same quality as safetensors, flexible CPU/GPU offloading
 
Model Specifications
Architecture Details
- Model Type: Vision-Language Transformer
 - Base Architecture: Qwen2.5 language model + Enhanced ViT vision encoder
 - Vision Encoder: Window attention with SwiGLU and RMSNorm
 - Parameters: 32.7 billion (text) + 675 million (vision encoder)
 
Available Formats
Safetensors (63GB):
- Precision: BF16 (Brain Float 16)
 - Framework: Transformers, Diffusers
 - Use Case: Maximum quality, fine-tuning, research
 - Compatibility: PyTorch, Hugging Face ecosystem
 
GGUF Formats:
- F16 (62GB): Full FP16 precision, identical quality to safetensors
 - Q5_K_M (22GB): 5-bit quantization with medium mix, ~98% quality retention
 - Q4_K_M (19GB): 4-bit quantization with medium mix, ~95% quality retention
 - Framework: llama.cpp, llama-cpp-python, LM Studio, Ollama
 - Use Case: CPU/GPU hybrid inference, consumer hardware deployment
 - Compatibility: Cross-platform (Windows, Linux, macOS, ARM)
 
Training Enhancements
- Reinforcement Learning: Enhanced mathematical and problem-solving capabilities
 - Dynamic Resolution Training: Adaptive image resolution and video FPS sampling
 - Human Preference Alignment: Improved response formatting and detail
 - Temporal Understanding: Extended dynamic resolution to video temporal dimension
 
Supported Input Formats
- Images: JPEG, PNG, WebP, local paths, URLs, base64-encoded
 - Videos: MP4, AVI, configurable FPS (0.5-30 fps typical)
 - Text: UTF-8 encoded, 32K context window
 - Structured Data: Tables, forms, invoices with JSON output capability
 
Benchmark Performance
| Benchmark | Score | Category | 
|---|---|---|
| MMMU | 70.0 | Multimodal Understanding | 
| MMMU-Pro | - | Advanced Reasoning | 
| MathVista | 74.7 | Mathematical Reasoning | 
| DocVQA | 94.8 | Document Understanding | 
| Android Control | 69.6/93.3 | Agentic Interaction | 
| MMLU | 78.4 | Language Understanding | 
| MATH | 82.2 | Mathematical Problem Solving | 
| HumanEval | 91.5 | Code Generation | 
Performance Tips and Optimization
Inference Optimization
Flash Attention 2 (Recommended):
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # 2-3x faster
    device_map="auto"
)
Quantization for Memory Efficiency:
# INT8 quantization (requires bitsandbytes)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    quantization_config=quantization_config,
    device_map="auto"
)
Extended Context (YaRN):
# For sequences > 32K tokens
model.config.rope_scaling = {
    "type": "yarn",
    "factor": 4.0,  # Extend to 128K tokens
    "original_max_position_embeddings": 32768
}
Best Practices
- Image Resolution: Use 512-1024px for standard images, up to 2048px for detailed document parsing
 - Video Processing: Adjust FPS based on content (1-2 fps for static scenes, 5-10 fps for action)
 - Batch Size: Start with batch_size=1-2 for 80GB VRAM, scale based on sequence length
 - Temperature: Use 0.1-0.3 for factual tasks, 0.7-0.9 for creative generation
 - Max Tokens: Allocate 512 tokens for descriptions, 2048+ for detailed analysis or math
 
Memory Management
# Clear CUDA cache between runs
import torch
torch.cuda.empty_cache()
# Gradient checkpointing for fine-tuning
model.gradient_checkpointing_enable()
# CPU offloading for large batches
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    offload_folder="offload",
    offload_state_dict=True
)
Production Deployment
Using vLLM (High Throughput):
pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model E:/huggingface/qwen2.5-vl-32b-instruct \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9
Using Text Generation Inference:
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v E:/huggingface/qwen2.5-vl-32b-instruct:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id /data --dtype bfloat16
Fine-tuning
The model supports fine-tuning using standard Hugging Face training workflows:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./qwen-vl-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-5,
    num_train_epochs=3,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch"
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)
trainer.train()
Available Variants
- Quantized Models: 25 quantized versions (INT8, INT4, GPTQ, AWQ)
 - Fine-tuned Adapters: 5 LoRA/QLoRA adapter models
 - Specialized Fine-tunes: 47 community fine-tuned variants for specific domains
 
License
This model is released under the Apache License 2.0 (base model license).
Copyright 2025 Alibaba Cloud (base model)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Terms of Use
- โ Commercial use allowed
 - โ Modification and distribution permitted
 - โ Private and public use
 - โ ๏ธ Must include original license and copyright notice
 - โ ๏ธ Provided "as-is" without warranty
 
Important Notice - Abliterated Model
โ ๏ธ This is an uncensored, abliterated variant with removed safety mechanisms:
- User Responsibility: Users are solely responsible for appropriate use and ethical considerations
 - No Built-in Safety: This model does not include content filtering or safety guardrails
 - Intended Use: Research, development, and responsible applications with proper oversight
 - Not Endorsed: This abliterated variant is not officially endorsed or supported by Alibaba Cloud
 - Legal Compliance: Users must ensure compliance with all applicable laws and regulations
 
Citation
If you use Qwen2.5-VL-32B-Instruct in your research or applications, please cite:
@article{qwen2.5-vl,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Jinze and others},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025},
  url={https://arxiv.org/abs/2502.13923}
}
Contact and Resources
Official Resources
- Model Card: Hugging Face - Qwen/Qwen2.5-VL-32B-Instruct
 - GitHub Repository: QwenLM/Qwen3-VL
 - Technical Report: arXiv:2502.13923
 - Official Blog: Qwen2.5-VL-32B Announcement
 
Community and Support
- Hugging Face Spaces: 86 community demos and applications
 - Discussions: Hugging Face Community
 - Issues: Report bugs on the GitHub repository
 
Related Models
- Qwen2.5-VL-7B-Instruct: Smaller variant for resource-constrained environments
 - Qwen2.5-VL-72B-Instruct: Larger variant with enhanced capabilities
 - Qwen2-VL-72B-Instruct: Previous generation model
 - Qwen2.5-VL-32B-Instruct: Original censored version with safety guardrails
 
Base Model: Qwen Team, Alibaba Cloud Abliteration: Community modification (uncensored variant) Base Release Date: March 25, 2025 Model Version: 2.5 (abliterated) README Version: v1.1
- Downloads last month
 - 234
 
16-bit