You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

VQ-Token Β· LLaVA-OneVision 0.5B (Extreme Token Reduction)

ArXiv Website GitHub

VQToken Teaser

VQToken is a neural discrete token representation for video that enables extreme token reduction (~0.07% of dense tokens) while retaining strong downstream performance.
This repository hosts the 0.5B VQToken-enabled LLaVA-OneVision checkpoint.


🧠 Model Summary

  • Base backbone: LLaVA-OneVision (0.5B)
  • VQToken module: learns discrete video tokens; supports fixed / adaptive token budgets
  • Goal: reduce video token count dramatically while preserving vLLM accuracy
  • Interface: works with lmms-eval (preferred), and the modified LLaVA-OneVision loader in the project repo

πŸ—οΈ How this checkpoint was trained

The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details.


πŸš€ Quick Test (CLI via lmms-eval)

We recommend testing with lmms-eval. The repo provides a ready-made script:

Or run the equivalent command directly:

# env (adjust as needed)
export HF_HOME="/path/to/your/hf/cache"
export HF_TOKEN="your_hf_token_here"
export HF_HUB_ENABLE_HF_TRANSFER=1
# if any eval calls OpenAI endpoints
# export OPENAI_API_KEY="your_openai_key_here"

# Helpful on some single-GPU setups
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"

PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \
  -m lmms_eval \
  --model llava_onevision_vqtoken \
  --model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \
  --tasks activitynetqa --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_onevision \
  --output_path ./logs_vqtoken/

You can swap --tasks for other video QA benchmarks supported by lmms-eval.


πŸ§ͺ Minimal Python Inference

import copy, numpy as np, torch
from decord import VideoReader, cpu
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b"
tok, model, imgproc, _ = load_pretrained_model(
    pretrained, None, "llava_qwen",
    device_map="auto", attn_implementation="sdpa", multimodal=True
)
model.eval()

def frames(path, n=16):
    vr = VideoReader(path, ctx=cpu(0))
    idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist()
    return vr.get_batch(idx).asnumpy()  # (T,H,W,C)

video = "sample/demo.mp4"
vid = frames(video, 16)
pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda()
images = [pix]

conv = copy.deepcopy(conv_templates["qwen_1_5"])
q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video."
conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
sizes = [f.shape[:2] for f in vid]

with torch.no_grad():
    out = model.generate(
        ids, images=images, image_sizes=sizes,
        do_sample=False, temperature=0, max_new_tokens=512,
        modalities=["video"], vis=True
    )

print(tok.batch_decode(out, skip_special_tokens=True)[0])

πŸ“¦ Intended Use & Notes

  • Use cases: video question answering, video captioning/understanding scenarios where token budget is tight.
  • Strengths: extreme token reduction (~0.07%) with competitive performance; fixed/adaptive regimes.
  • Out-of-scope / caveats: model may hallucinate or be brittle on out-of-distribution content; always validate on your task.

πŸ“Š Evaluation

We evaluate through lmms-eval for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive).


πŸ”— Resources


πŸ“š Citation

@inproceedings{zhang2025vqtoken,
  title     = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models},
  author    = {Haichao Zhang and Yun Fu},
  booktitle = {NeurIPS},
  year      = {2025}
}

πŸ™ Acknowledgements

Thanks to LLaVA-OneVision / LLaVA-NeXT and lmms-eval communities for open tooling and baselines.

Downloads last month
98
Safetensors
Model size
1B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for haichaozhang/VQ-Token-llava-ov-0.5b

Finetuned
(10)
this model

Dataset used to train haichaozhang/VQ-Token-llava-ov-0.5b