LLaVA-OneVision on Amazon SageMaker

This repository contains code and instructions for deploying the LLaVA-OneVision multimodal model on Amazon SageMaker using the Hugging Face Inference Toolkit.

Original Model: This deployment is based on the llava-hf/llava-onevision-qwen2-0.5b-ov-hf model. For more details about the model architecture and capabilities, see the original model card.

Overview

LLaVA-OneVision is an open-source multimodal LLM that can simultaneously handle single-image, multi-image, and video scenarios. This repository provides everything you need to deploy it as a scalable inference endpoint on Amazon SageMaker.

What's Included

code/inference.py: SageMaker inference handler with model loading and prediction logic
code/requirements.txt: Python dependencies for the inference container
code/test_inference.py: Local testing script to validate the pipeline before deployment
llava_ov_sm.ipynb: Complete deployment notebook with step-by-step instructions

Prerequisites

Install the Amazon SageMaker Python SDK:

pip install -qU sagemaker

Quick Start: Deploy to SageMaker

1. Setup

import sagemaker
from sagemaker.huggingface import HuggingFaceModel

# Initialize SageMaker session
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

2. Create Model

hub = {
    'HF_MODEL_ID': "jgalego/llava-onevision-qwen2-0.5b-ov-hf",
    'HF_TASK': "image-text-to-text"
}

huggingface_model = HuggingFaceModel(
    transformers_version="4.49",
    pytorch_version="2.6",
    py_version="py312",
    env=hub,
    role=role,
    entry_point='inference.py',
    source_dir='./code'
)

3. Deploy Endpoint

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge',
    endpoint_name='llava-onevision-endpoint',
    model_data_download_timeout=5*60,
    container_startup_health_check_timeout=5*60
)

4. Make Predictions

import base64

# Read and encode your image
with open('example.jpg', 'rb') as f:
    image_bytes = base64.b64encode(f.read()).decode('utf-8')

# Prepare request
payload = {
    "inputs": "Describe this image in detail.",
    "images": [image_bytes],
    "parameters": {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True,
        "repetition_penalty": 1.2,
        "no_repeat_ngram_size": 3
    }
}

# Get prediction
response = predictor.predict(payload)
print(response['generated_text'])

5. Cleanup

predictor.delete_model()
predictor.delete_endpoint()

Local Testing

Before deploying to SageMaker, test the inference script locally:

# Test with an image
python code/test_inference.py --image code/example.jpg --prompt "Describe this image in detail."

# Test text-only (no image)
python code/test_inference.py --prompt "What is the capital of France?"

# Run complete test suite
python code/test_inference.py --image code/example.jpg --test-suite

Input Format

The endpoint expects JSON with the following structure:

{
  "inputs": "Your text prompt here",
  "images": ["base64_encoded_image_1", "base64_encoded_image_2"],
  "parameters": {
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.9,
    "do_sample": true,
    "repetition_penalty": 1.2,
    "no_repeat_ngram_size": 3
  }
}

Parameters

inputs (required): Text prompt/question
images (optional): Array of base64-encoded images
parameters (optional): Generation parameters
- max_new_tokens: Maximum tokens to generate (default: 512)
- temperature: Sampling temperature (default: 0.7)
- top_p: Nucleus sampling parameter (default: 0.9)
- do_sample: Enable sampling (default: true)
- repetition_penalty: Penalty for repeating tokens (default: 1.2)
- no_repeat_ngram_size: Prevent n-gram repetition (default: 3)

Output Format

The endpoint returns:

{
  "generated_text": "The model's generated response...",
  "prompt": "Original prompt",
  "num_images": 1
}

Instance Types

Recommended SageMaker instance types:

Instance Type	vCPUs	GPU Memory	Use Case
`ml.g4dn.xlarge`	4	16 GB	Development/Testing
`ml.g4dn.2xlarge`	8	16 GB	Light production
`ml.g5.xlarge`	4	24 GB	Production
`ml.g5.2xlarge`	8	24 GB	High throughput

Troubleshooting

CUDA Out of Memory

Use a larger GPU instance (e.g., ml.g5.2xlarge or higher)

Repetitive Text Generation

Increase repetition_penalty (e.g., 1.3-1.5) or adjust no_repeat_ngram_size

Slow Inference

Use a more powerful instance type
Consider model quantization
Reduce max_new_tokens

Endpoint Timeout

Increase timeout values when deploying:

predictor = huggingface_model.deploy(
    ...
    model_data_download_timeout=10*60,
    container_startup_health_check_timeout=10*60
)

Resources

Downloads last month: 4

Safetensors

Model size

0.9B params

Tensor type

F16

jgalego
/

llava-onevision-qwen2-0.5b-ov-hf