LLaVA-OneVision on Amazon SageMaker

This repository contains code and instructions for deploying the LLaVA-OneVision multimodal model on Amazon SageMaker using the Hugging Face Inference Toolkit.

image/png

Original Model: This deployment is based on the llava-hf/llava-onevision-qwen2-0.5b-ov-hf model. For more details about the model architecture and capabilities, see the original model card.

Overview

LLaVA-OneVision is an open-source multimodal LLM that can simultaneously handle single-image, multi-image, and video scenarios. This repository provides everything you need to deploy it as a scalable inference endpoint on Amazon SageMaker.

What's Included

  • code/inference.py: SageMaker inference handler with model loading and prediction logic
  • code/requirements.txt: Python dependencies for the inference container
  • code/test_inference.py: Local testing script to validate the pipeline before deployment
  • llava_ov_sm.ipynb: Complete deployment notebook with step-by-step instructions

Prerequisites

Install the Amazon SageMaker Python SDK:

pip install -qU sagemaker

Quick Start: Deploy to SageMaker

1. Setup

import sagemaker
from sagemaker.huggingface import HuggingFaceModel

# Initialize SageMaker session
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

2. Create Model

hub = {
    'HF_MODEL_ID': "jgalego/llava-onevision-qwen2-0.5b-ov-hf",
    'HF_TASK': "image-text-to-text"
}

huggingface_model = HuggingFaceModel(
    transformers_version="4.49",
    pytorch_version="2.6",
    py_version="py312",
    env=hub,
    role=role,
    entry_point='inference.py',
    source_dir='./code'
)

3. Deploy Endpoint

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge',
    endpoint_name='llava-onevision-endpoint',
    model_data_download_timeout=5*60,
    container_startup_health_check_timeout=5*60
)

4. Make Predictions

import base64

# Read and encode your image
with open('example.jpg', 'rb') as f:
    image_bytes = base64.b64encode(f.read()).decode('utf-8')

# Prepare request
payload = {
    "inputs": "Describe this image in detail.",
    "images": [image_bytes],
    "parameters": {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True,
        "repetition_penalty": 1.2,
        "no_repeat_ngram_size": 3
    }
}

# Get prediction
response = predictor.predict(payload)
print(response['generated_text'])

5. Cleanup

predictor.delete_model()
predictor.delete_endpoint()

Local Testing

Before deploying to SageMaker, test the inference script locally:

# Test with an image
python code/test_inference.py --image code/example.jpg --prompt "Describe this image in detail."

# Test text-only (no image)
python code/test_inference.py --prompt "What is the capital of France?"

# Run complete test suite
python code/test_inference.py --image code/example.jpg --test-suite

Input Format

The endpoint expects JSON with the following structure:

{
  "inputs": "Your text prompt here",
  "images": ["base64_encoded_image_1", "base64_encoded_image_2"],
  "parameters": {
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.9,
    "do_sample": true,
    "repetition_penalty": 1.2,
    "no_repeat_ngram_size": 3
  }
}

Parameters

  • inputs (required): Text prompt/question
  • images (optional): Array of base64-encoded images
  • parameters (optional): Generation parameters
    • max_new_tokens: Maximum tokens to generate (default: 512)
    • temperature: Sampling temperature (default: 0.7)
    • top_p: Nucleus sampling parameter (default: 0.9)
    • do_sample: Enable sampling (default: true)
    • repetition_penalty: Penalty for repeating tokens (default: 1.2)
    • no_repeat_ngram_size: Prevent n-gram repetition (default: 3)

Output Format

The endpoint returns:

{
  "generated_text": "The model's generated response...",
  "prompt": "Original prompt",
  "num_images": 1
}

Instance Types

Recommended SageMaker instance types:

Instance Type vCPUs GPU Memory Use Case
ml.g4dn.xlarge 4 16 GB Development/Testing
ml.g4dn.2xlarge 8 16 GB Light production
ml.g5.xlarge 4 24 GB Production
ml.g5.2xlarge 8 24 GB High throughput

Troubleshooting

CUDA Out of Memory

Use a larger GPU instance (e.g., ml.g5.2xlarge or higher)

Repetitive Text Generation

Increase repetition_penalty (e.g., 1.3-1.5) or adjust no_repeat_ngram_size

Slow Inference

  • Use a more powerful instance type
  • Consider model quantization
  • Reduce max_new_tokens

Endpoint Timeout

Increase timeout values when deploying:

predictor = huggingface_model.deploy(
    ...
    model_data_download_timeout=10*60,
    container_startup_health_check_timeout=10*60
)

Resources

Downloads last month
4
Safetensors
Model size
0.9B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train jgalego/llava-onevision-qwen2-0.5b-ov-hf