--- base_model: - DeepGlint-AI/rice-vit-large-patch14-560 - Qwen/Qwen3-4B-Instruct-2507 datasets: - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text --- # LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model This is the official Hugging Face model card for **LLaVA-OneVision-1.5**, a novel family of Large Multimodal Models (LMMs) presented in the paper [LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training](https://huggingface.co/papers/2509.23661). 📚 [Paper](https://huggingface.co/papers/2509.23661) | 💻 [Code](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5) | 🏠 [Project Page](https://huggingface.co/collections/lmms-lab/llava-onevision-15-68d385fe73b50bd22de23713) | 🚀 [Demo](https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5) # ✨ Key Features **LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images. 1. **Superior Performance** A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks. 2. **High-Quality Data at Scale** Meticulously curated **mid-training and SFT data** with rigorous filtering and quality control. - Concept-balanced, highly diverse, high-quality caption data - Comprehensive instruction fine-tuning data covering a wide range of tasks 3. **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency: - **$16K total budget** for full model training - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization** - Optimized codebase for cost-effective scaling 4. **Fully Open Framework** for community access and reproducibility: - ✅ High-quality mid-training & SFT data - ✅ Complete training framework & code - ✅ Training recipes & configurations - ✅ Base & instruct model checkpoints - ✅ Comprehensive training logs & metrics ## Models | Model | HF Link | Training Log | |---|---|---| | LLaVA-OV-1.5-4B-Instruct | [🤗 HF / 4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct) | [📈 Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct/tensorboard) | | LLaVA-OV-1.5-8B-Instruct | [🤗 HF / 8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | [📈 Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct/tensorboard) | ## Dataset | Description | Link | |---|---| | Mid-training data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | | SFT data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | ## Evaluation Results All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).  ## Quick Start with HuggingFace Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: ```python from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM from qwen_vl_utils import process_vision_info model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct" # default: Load the model on the available device(s) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # default processer processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Evaluation ``` # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \ --model=llava_onevision1_5 \ --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \ --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \ --batch_size=1 ``` ## Quick Start Guide ### 1.🐳 Docker (Recommended) We strongly recommend using the docker environment for a seamless experience. The following instructions are tailored for the A100 80GB GPU environment. ```bash # Clone repository git clone https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5.git cd LLaVA-OneVision-1.5 docker build -t llava_megatron:25.04 . # Run container with -w to set working directory directly to the mounted volume docker run -it --gpus all \ --ipc host --net host --privileged --cap-add IPC_LOCK \ --ulimit memlock=-1 --ulimit stack=67108864 --rm \ -v $(pwd):/workspace/LLaVA-OneVision-1.5 \ -w /workspace/LLaVA-OneVision-1.5 \ --name "llava_megatron_container" \ llava_megatron:25.04 /bin/bash ``` ### 2. Checkpoint and Format Conversion You have two options to get started with LLaVA-OneVision-1.5-stage-0: #### Option 1: Download pre-trained model from HuggingFace Download our `LLaVA-OneVision-1.5-4B-stage0` model directly from [HuggingFace](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-stage0). #### Option 2: Merge initial weights yourself Alternatively, you can merge the initial weights from the original ViT and LLM: ```bash python ds/merge_model.py \ --vit_path DeepGlint-AI/rice-vit-large-patch14-560 \ --llm_path Qwen/Qwen3-4B-Instruct-2507 \ --output LLaVA-OneVision-1.5-4B-stage0 ``` Note: When merging weights, the adapter component will be initialized with default values. Convert the model from HuggingFace format to Megatron format: ```bash AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 bash examples/llava_ov_1_5/convert/convert_4b_hf_to_mcore.sh \ LLaVA-OneVision-1.5-4B-stage0 \ LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \ 1 1 ``` ### 3. Stage 1 Alignment-Training Download LLaVA from [LLaVA-558K-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-558K-Webdataset). ```bash # ============================================================ # Required environment variables: # AIAK_TRAINING_PATH Root directory of the AIAK-Training-LLM project # DATA_PATH Directory with WebDataset shards (.tar) for pretraining # TOKENIZER_PATH Hugging Face tokenizer directory # CHECKPOINT_PATH Megatron-formatted checkpoint directory (e.g., mcore TP1/PP1) # SAVE_CKPT_PATH Output directory for saving training checkpoints AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ DATA_PATH=LLaVA-558K-Webdataset \ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \ CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \ bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh ``` ### 4. Stage 1.5 Mid-Training Download our lightweight packed subset from [LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Mid-Training-Webdataset-Quick-Start-3M). ```bash # ============================================================ # Convert model to release format bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \ stage_1_alignment_llava_ov_4b/iter_0002500/ \ stage_1_alignment_llava_ov_4b_release 1 1 # ============================================================ # Launch AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ DATA_PATH=LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset \ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \ CHECKPOINT_PATH=stage_1_alignment_llava_ov_4b_release \ bash examples/llava_ov_1_5/quick_start/stage_1.5_mid_training_llava_ov_4b.sh ``` ### 5. Stage 2 Instruct-Training Download LLaVA-NeXT-780k-webdataset at [LLaVA-NeXT-780K Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-780k-webdataset). ```bash # ============================================================ # Convert model to release format bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \ stage_1.5_mid_training_llava_ov_4b/iter_0020000/ \ stage_1.5_mid_training_llava_ov_4b_release 1 1 # ============================================================ # # Launch AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ DATA_PATH=LLaVA-NeXT-780k-Webdataset \ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \ CHECKPOINT_PATH=stage_1.5_mid_training_llava_ov_4b_release \ bash examples/llava_ov_1_5/quick_start/stage_2_instruct_llava_ov_4b.sh ``` ### 6. Convert mcore to huggingface ```bash AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_hf.sh \ stage_2_instruct_llava_ov_4b/iter_0003500 \ LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct \ 1 1 # Copy non-model files (e.g., tokenizer config) to the new directory find LLaVA-OneVision-1.5-4B-stage0/ -type f -not -iname '*safetensors*' -exec cp {} LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct/ ';' ``` ### 7. Evaluation ```bash # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch \ --num_processes=4 --main_process_port 12399 -m lmms_eval --model=llava_onevision1_5 --batch_size=1 --tasks=mme \ --model_args=pretrained=/workspace/LLaVA-OneVision-1.5/LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct,max_pixels=3240000 ``` ## Fully Reproducing Guide > [!TIP] > More detailed reproduction steps for the complete process will be provided after the dataset upload is completed. ### Mid-Training To improve model training efficiency, we implement offline sample packing: 1. Download the [**Mid-Training-85M Dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) 2. Pack the data into webdataset format, refer to [**Examples offlinepacking**](examples_offline_packing) and [**Offline Padding-Free Data Packing**](examples/llava_ov_1_5/sample_packing/README.md) ### Instruct 1. Download the [**LLaVA-OneVision-1.5-Insturct-Data**](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) 2. Convert the data into webdataset format, refer to [**Conversion for Mixed Instruction Data**](docs/sft_data_preprocessing.md) ## Roadmaps Q4 2025 Key Deliverables: 1. **Ultra-efficient MoE Training** 2. **Full Video Input LLM** ## Contributors Thanks so much to all of our amazing contributors!
|
fdcp |
anxiangsir |
yiyexy |
wideyard |
chengzheng345 |
killTheHostage |
mathCrazyy |
yunglechao |
|
RobitYadda |