LLaVA-OneVision on Amazon SageMaker
This repository contains code and instructions for deploying the LLaVA-OneVision multimodal model on Amazon SageMaker using the Hugging Face Inference Toolkit.
Original Model: This deployment is based on the
llava-hf/llava-onevision-qwen2-0.5b-ov-hfmodel. For more details about the model architecture and capabilities, see the original model card.
Overview
LLaVA-OneVision is an open-source multimodal LLM that can simultaneously handle single-image, multi-image, and video scenarios. This repository provides everything you need to deploy it as a scalable inference endpoint on Amazon SageMaker.
What's Included
code/inference.py: SageMaker inference handler with model loading and prediction logiccode/requirements.txt: Python dependencies for the inference containercode/test_inference.py: Local testing script to validate the pipeline before deploymentllava_ov_sm.ipynb: Complete deployment notebook with step-by-step instructions
Prerequisites
Install the Amazon SageMaker Python SDK:
pip install -qU sagemaker
Quick Start: Deploy to SageMaker
1. Setup
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
# Initialize SageMaker session
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
2. Create Model
hub = {
'HF_MODEL_ID': "jgalego/llava-onevision-qwen2-0.5b-ov-hf",
'HF_TASK': "image-text-to-text"
}
huggingface_model = HuggingFaceModel(
transformers_version="4.49",
pytorch_version="2.6",
py_version="py312",
env=hub,
role=role,
entry_point='inference.py',
source_dir='./code'
)
3. Deploy Endpoint
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type='ml.g4dn.xlarge',
endpoint_name='llava-onevision-endpoint',
model_data_download_timeout=5*60,
container_startup_health_check_timeout=5*60
)
4. Make Predictions
import base64
# Read and encode your image
with open('example.jpg', 'rb') as f:
image_bytes = base64.b64encode(f.read()).decode('utf-8')
# Prepare request
payload = {
"inputs": "Describe this image in detail.",
"images": [image_bytes],
"parameters": {
"max_new_tokens": 256,
"temperature": 0.7,
"top_p": 0.9,
"do_sample": True,
"repetition_penalty": 1.2,
"no_repeat_ngram_size": 3
}
}
# Get prediction
response = predictor.predict(payload)
print(response['generated_text'])
5. Cleanup
predictor.delete_model()
predictor.delete_endpoint()
Local Testing
Before deploying to SageMaker, test the inference script locally:
# Test with an image
python code/test_inference.py --image code/example.jpg --prompt "Describe this image in detail."
# Test text-only (no image)
python code/test_inference.py --prompt "What is the capital of France?"
# Run complete test suite
python code/test_inference.py --image code/example.jpg --test-suite
Input Format
The endpoint expects JSON with the following structure:
{
"inputs": "Your text prompt here",
"images": ["base64_encoded_image_1", "base64_encoded_image_2"],
"parameters": {
"max_new_tokens": 256,
"temperature": 0.7,
"top_p": 0.9,
"do_sample": true,
"repetition_penalty": 1.2,
"no_repeat_ngram_size": 3
}
}
Parameters
inputs(required): Text prompt/questionimages(optional): Array of base64-encoded imagesparameters(optional): Generation parametersmax_new_tokens: Maximum tokens to generate (default: 512)temperature: Sampling temperature (default: 0.7)top_p: Nucleus sampling parameter (default: 0.9)do_sample: Enable sampling (default: true)repetition_penalty: Penalty for repeating tokens (default: 1.2)no_repeat_ngram_size: Prevent n-gram repetition (default: 3)
Output Format
The endpoint returns:
{
"generated_text": "The model's generated response...",
"prompt": "Original prompt",
"num_images": 1
}
Instance Types
Recommended SageMaker instance types:
| Instance Type | vCPUs | GPU Memory | Use Case |
|---|---|---|---|
ml.g4dn.xlarge |
4 | 16 GB | Development/Testing |
ml.g4dn.2xlarge |
8 | 16 GB | Light production |
ml.g5.xlarge |
4 | 24 GB | Production |
ml.g5.2xlarge |
8 | 24 GB | High throughput |
Troubleshooting
CUDA Out of Memory
Use a larger GPU instance (e.g., ml.g5.2xlarge or higher)
Repetitive Text Generation
Increase repetition_penalty (e.g., 1.3-1.5) or adjust no_repeat_ngram_size
Slow Inference
- Use a more powerful instance type
- Consider model quantization
- Reduce
max_new_tokens
Endpoint Timeout
Increase timeout values when deploying:
predictor = huggingface_model.deploy(
...
model_data_download_timeout=10*60,
container_startup_health_check_timeout=10*60
)
Resources
- Downloads last month
- 4
