LongCLIP-ViT-B-32 (CLIP ViT-B/32 with 512-token text context)
What’s different vs standard CLIP
- based on CLIP ViT-B/32 (from laion/CLIP-ViT-B-32-laion2B-s34B-b79K)
- Longer text context:
max_position_embeddings=512 (vs the usual 77).
- Train data: 1.67M image-caption pairs, caption regenerated by Qwen2.5-VL-72B (512-token max length), images sampled from LAION-2B.
Usage
import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
model_id = "AlpachinoNLP/LongCLIP-ViT-B-32"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
image = Image.open("path/to/image.jpg").convert("RGB")
texts = ["a short caption", "a much longer caption ..."]
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512,
)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)
print(probs[0].tolist())
Zero-shot classification example
import torch
from PIL import Image
import requests
from transformers import CLIPModel, CLIPProcessor
model_id = "AlpachinoNLP/LongCLIP-ViT-B-32"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
labels = ["cat", "dog", "playing music"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
probs = model(**inputs).logits_per_image.softmax(dim=-1)[0]
print({label: float(p) for label, p in zip(labels, probs)})
Image/text embeddings
import torch
from transformers import CLIPModel, CLIPProcessor
model_id = "AlpachinoNLP/LongCLIP-ViT-B-32"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
inputs = processor(text=["a caption"], return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
text_features = model.get_text_features(**inputs)
Limitations
- This is a research/teaching checkpoint from a local training pipeline; evaluate on your target data before using it downstream.
- Texts longer than 512 tokens are truncated by the processor when
truncation=True.