LongCLIP-ViT-B-32 (CLIP ViT-B/32 with 512-token text context)

What’s different vs standard CLIP

  • based on CLIP ViT-B/32 (from laion/CLIP-ViT-B-32-laion2B-s34B-b79K)
  • Longer text context: max_position_embeddings=512 (vs the usual 77).
  • Train data: 1.67M image-caption pairs, caption regenerated by Qwen2.5-VL-72B (512-token max length), images sampled from LAION-2B.

Usage

import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model_id = "AlpachinoNLP/LongCLIP-ViT-B-32" 
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

image = Image.open("path/to/image.jpg").convert("RGB")
texts = ["a short caption", "a much longer caption ..."]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512,
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=-1)  # shape: [num_images, num_texts]
print(probs[0].tolist())

Zero-shot classification example

import torch
from PIL import Image
import requests
from transformers import CLIPModel, CLIPProcessor

model_id = "AlpachinoNLP/LongCLIP-ViT-B-32"  # or "<your-hf-org>/<your-model-repo>"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
labels = ["cat", "dog", "playing music"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
    probs = model(**inputs).logits_per_image.softmax(dim=-1)[0]
print({label: float(p) for label, p in zip(labels, probs)})

Image/text embeddings

import torch
from transformers import CLIPModel, CLIPProcessor

model_id = "AlpachinoNLP/LongCLIP-ViT-B-32"  # or "<your-hf-org>/<your-model-repo>"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

inputs = processor(text=["a caption"], return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
    text_features = model.get_text_features(**inputs)  # [B, 512]

Limitations

  • This is a research/teaching checkpoint from a local training pipeline; evaluate on your target data before using it downstream.
  • Texts longer than 512 tokens are truncated by the processor when truncation=True.
Downloads last month
16
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support