Spaces:
Runtime error
Runtime error
| ## Data preparation | |
| ### data for training | |
| - The images pretraining dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA). | |
| - The images tuning dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA). | |
| - The videos pretraining dataset is from [Valley](https://github.com/RupertLuo/Valley). | |
| - The videos tuning dataset is from [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT). | |
| - Download the training annotations. You can download from [Baidu Disk](https://pan.baidu.com/s/1BipI3_f--GRWqaWTGYp-Jg?pwd=wkl0), [Google Disk](https://drive.google.com/file/d/11-1NBXNeiNQE2wPbue1dFph_Na_EHRYG/view?usp=drive_link) or [Peking University Disk](https://disk.pku.edu.cn:443/link/84783AB54553DFA150C1C5E82C16EB29) | |
| We also provide the processed data as follows. | |
| <div align="center"> | |
| <table border="1" width="100%"> | |
| <tr align="center"> | |
| <th>Datasets</th><th>Baidu Disk</th> | |
| </tr> | |
| <tr align="center"> | |
| <td>Image pretraining</td><td><a href="">Link</a></td> | |
| </tr> | |
| </tr> | |
| <tr align="center"> | |
| <td>Image tuning</td><td><a href="">Link</a></td> | |
| </tr> | |
| </tr> | |
| <tr align="center"> | |
| <td>Video pretraining</td><td><a href="">Link</a></td> | |
| </tr> | |
| </tr> | |
| <tr align="center"> | |
| <td>Video tuning</td><td><a href="">Link</a></td> | |
| </tr> | |
| </table> | |
| </div> | |
| After downloading all of them, organize the data as follows in ```DATA_ROOT```. | |
| ```Shell | |
| DATA_ROOT | |
| βββ llava_image | |
| βββ llava_image_tune | |
| βββ valley | |
| βββ videochatgpt_tune | |
| ``` | |
| ### data for validating | |
| - For image, follow LLaVA's instructions. ***You MUST first download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing)**. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `eval`. This also provides a general structure for all datasets.* | |
| - For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows. | |
| <div align="center"> | |
| <table border="1" width="100%"> | |
| <tr align="center"> | |
| <th>Datasets</th><th>Baidu Disk</th><th>Google Disk</th><th>Peking University Disk</th> | |
| </tr> | |
| <tr align="center"> | |
| <td>Activitynet_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/1d_AVx9Mz_57nA3exhQZGyA?pwd=9amr ">Link</a></td><td>-</td><td>-</td> | |
| </tr> | |
| </tr> | |
| <tr align="center"> | |
| <td>MSRVTT_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/1QHUtwHXm4Vc-Wc12XFCFsA?pwd=1rj8">Link</a></td><td><a href="https://drive.google.com/file/d/1yXh9lz7flQ5Ui2IRSd6Qi6RqSEeUJwl3/view?usp=drive_link">Link</a></td><td>-</td> | |
| </tr> | |
| </tr> | |
| <tr align="center"> | |
| <td>MSVD_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/1PJSHkjHG2BPl_ddUnBj9AA?pwd=jj34">Link</a></td><td><a href="https://drive.google.com/file/d/1_q4eiSdb7i8P3Hmh4lCfgY1uBGyzU_7X/view?usp=drive_link">Link</a></td><td><a href="https://disk.pku.edu.cn:443/link/8B0D01747D8AA65534820B7E60CBFEFC">Link</a></td> | |
| </tr> | |
| </tr> | |
| <tr align="center"> | |
| <td>TGIF_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/11ubtWbTtubyBmN9UPvAyow?pwd=98yr">Link</a></td><td><a href="https://drive.google.com/file/d/1so6L9rg_gdC8Segur7rKML-ffd4Ix_I6/view?usp=drive_link">Link</a></td><td><a href="https://disk.pku.edu.cn:443/link/B9AB387EFE8817158F181FF3D7A97163">Link</a></td> | |
| </tr> | |
| </table> | |
| </div> | |
| After downloading all of them, organize the data as follows in `eval`. | |
| ```Shell | |
| eval | |
| βββ GPT_Zero_Shot_QA | |
| βΒ Β βββ Activitynet_Zero_Shot_QA | |
| βΒ Β βββ MSRVTT_Zero_Shot_QA | |
| βΒ Β βββ MSVD_Zero_Shot_QA | |
| βΒ Β βββ TGIF_Zero_Shot_QA | |
| βββ gqa | |
| βΒ Β βββ answers | |
| βΒ Β βββ data | |
| βΒ Β βββ llava_gqa_testdev_balanced.jsonl | |
| βββ llava-bench-in-the-wild | |
| βΒ Β βββ answers | |
| βΒ Β βββ answers_gpt4.jsonl | |
| βΒ Β βββ bard_0718.jsonl | |
| βΒ Β βββ bing_chat_0629.jsonl | |
| βΒ Β βββ context.jsonl | |
| βΒ Β βββ images | |
| βΒ Β βββ questions.jsonl | |
| βΒ Β βββ README.md | |
| βΒ Β βββ reviews | |
| βββ mmbench | |
| βΒ Β βββ answers | |
| βΒ Β βββ answers_upload | |
| βΒ Β βββ mmbench_dev_20230712.tsv | |
| βΒ Β βββ mmbench_dev_en_20231003.tsv | |
| βββ MME | |
| βΒ Β βββ answers | |
| βΒ Β βββ convert_answer_to_mme.py | |
| βΒ Β βββ llava_mme.jsonl | |
| βββ mm-vet | |
| βΒ Β βββ answers | |
| βΒ Β βββ bard_set.json | |
| βΒ Β βββ convert_answers.py | |
| βΒ Β βββ images | |
| βΒ Β βββ llava-mm-vet.jsonl | |
| βΒ Β βββ mm-vet.json | |
| βΒ Β βββ results | |
| βββ pope | |
| βΒ Β βββ answers | |
| βΒ Β βββ coco | |
| βΒ Β βββ llava_pope_test.jsonl | |
| βΒ Β βββ val2014 | |
| βββ scienceqa | |
| βΒ Β βββ answers | |
| βΒ Β βββ images | |
| βΒ Β βββ llava_test_CQM-A.json | |
| βΒ Β βββ pid_splits.json | |
| βΒ Β βββ problems.json | |
| βββ seed_bench | |
| βΒ Β βββ answers | |
| βΒ Β βββ answers_upload | |
| βΒ Β βββ extract_video_frames.py | |
| βΒ Β βββ llava-seed-bench.jsonl | |
| βββ textvqa | |
| βΒ Β βββ answers | |
| βΒ Β βββ llava_textvqa_val_v051_ocr.jsonl | |
| βΒ Β βββ TextVQA_0.5.1_val.json | |
| βΒ Β βββ train_images | |
| βββ vizwiz | |
| βΒ Β βββ answers | |
| βΒ Β βββ answers_upload | |
| βΒ Β βββ llava_test.jsonl | |
| βΒ Β βββ test | |
| βΒ Β βββ test.json | |
| βΒ Β βββ train.json | |
| βΒ Β βββ val.json | |
| βββ vqav2 | |
| βββ answers | |
| βββ answers_upload | |
| βββ llava_vqav2_mscoco_test2015.jsonl | |
| βββ llava_vqav2_mscoco_test-dev2015.jsonl | |
| βββ test2015 | |
| ``` | |
| ## Training | |
| Specify your `DATA_ROOT` according to the data preparation. | |
| - Stage 1 pretraining script: [pretrain.sh](scripts/v1_5/pretrain.sh). | |
| - Stage 2 tuning script: [finetune.sh](scripts/v1_5/finetune.sh). | |
| ## Validating | |
| Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution! | |
| You can refer to the official repository for validation, but we also provide [off-the-shelf](scripts/v1_5/eval) scripts. | |
| ### MSRVTT-QA | |
| 1. Inference to get the result. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh | |
| ``` | |
| 2. GPT-Assistant evaluation. | |
| ```Shell | |
| bash scripts/v1_5/eval/eval_qa_msrvtt.sh | |
| ``` | |
| ### MSVD-QA | |
| 1. Inference to get the result. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh | |
| ``` | |
| 2. GPT-Assistant evaluation. | |
| ```Shell | |
| bash scripts/v1_5/eval/eval_qa_msvd.sh | |
| ``` | |
| ### TGIF-QA | |
| 1. Inference to get the result. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh | |
| ``` | |
| 2. GPT-Assistant evaluation. | |
| ```Shell | |
| bash scripts/v1_5/eval/eval_qa_tgif.sh | |
| ``` | |
| ### ActivityNet-QA | |
| 1. Inference to get the result. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh | |
| ``` | |
| 2. GPT-Assistant evaluation. | |
| ```Shell | |
| bash scripts/v1_5/eval/eval_qa_activitynet.sh | |
| ``` | |
| ### VQAv2 | |
| 1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `eval/vqav2`. | |
| 2. Multi-GPU inference. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh | |
| ``` | |
| 3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/830/my-submission): `eval/vqav2/answers_upload`. | |
| ### GQA | |
| 1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `eval/gqa/data`. | |
| 2. Multi-GPU inference. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh | |
| ``` | |
| ### VisWiz | |
| 1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `eval/vizwiz`. | |
| 2. Single-GPU inference. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh | |
| ``` | |
| 3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/1911/my-submission): `eval/vizwiz/answers_upload`. | |
| ### ScienceQA | |
| 1. Under `eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA). | |
| 2. Single-GPU inference and evaluate. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh | |
| ``` | |
| ### TextVQA | |
| 1. Download [`TextVQA_0.5.1_val.json`](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `eval/textvqa`. | |
| 2. Single-GPU inference and evaluate. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh | |
| ``` | |
| ### POPE | |
| 1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `eval/pope`. | |
| 2. Single-GPU inference and evaluate. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh | |
| ``` | |
| ### MMBench | |
| 1. Download [`mmbench_dev_20230712.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv) and put under `eval/mmbench`. | |
| 2. Single-GPU inference. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh | |
| ``` | |
| 3. Submit the results to the [evaluation server](https://opencompass.org.cn/leaderboard-multimodal): `eval/mmbench/answers_upload/mmbench_dev_20230712`. | |
| ### LLaVA-Bench-in-the-Wild | |
| 1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `eval/llava-bench-in-the-wild`. | |
| 2. Single-GPU inference and evaluate. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh | |
| ``` | |
| ### MM-Vet | |
| 1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `eval/mmvet`. | |
| 2. Single-GPU inference. | |
| ```Shell | |
| CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh | |
| ``` | |