lorebianchi98 commited on
Commit
ea4a8f5
·
verified ·
1 Parent(s): 4313f8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -4
README.md CHANGED
@@ -1,10 +1,139 @@
1
  ---
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
 
 
 
 
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-segmentation
4
+ library_name: Pytorch
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
8
+ - DINOv2
9
+ - CLIP
10
+ - open-vocabulary segmentation
11
  ---
12
+ <div align="center">
13
+ <h1>
14
+ Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation (ICCV 2025)
15
+ </h1>
16
 
17
+ <h3>
18
+ <a href="https://www.linkedin.com/in/luca-barsellotti/">Luca Barsellotti*</a>&ensp;
19
+ <a href="https://www.linkedin.com/in/lorenzo-bianchi-893bb225a/">Lorenzo Bianchi*</a>&ensp;
20
+ <a href="https://www.linkedin.com/in/nicola-messina-a33848164/">Nicola Messina</a>&ensp;
21
+ <a href="https://www.linkedin.com/in/fabio-carrara-b28a2b111/">Fabio Carrara</a>&ensp;
22
+ <a href="https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90">Marcella Cornia</a>&ensp;
23
+ <a href="https://www.lorenzobaraldi.com/">Lorenzo Baraldi</a>&ensp;
24
+ <a href="https://fabriziofalchi.it">Fabrizio Falchi</a>&ensp;
25
+ <a href="https://www.linkedin.com/in/rita-cucchiara-a4653a13/">Rita Cucchiara</a>
26
+ </h3>
27
+
28
+ [Project Page](https://lorebianchi98.github.io/Talk2DINO/) | [Paper](http://arxiv.org/abs/2411.19331) | [Code](https://github.com/lorebianchi98/Talk2DINO)
29
+
30
+ </div>
31
+
32
+ <div align="center">
33
+ <figure>
34
+ <img alt="Overview of Talk2DINO" src="./assets/overview.png" width="40%">
35
+ </figure>
36
+ </div>
37
+
38
+ ## About
39
+ Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
40
+
41
+ ## Sample Usage
42
+
43
+ ### Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO
44
+ We can use Talk2DINO to map CLIP text embeddings into the DINOv2 patch embedding space.
45
+ ```python
46
+ import clip
47
+ from src.model import ProjectionLayer
48
+ import torch
49
+ import os
50
+ # Device setup
51
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
52
+ # Configuration and weights
53
+ proj_name = 'vitb_mlp_infonce'
54
+ config_path = os.path.join("configs", f"{proj_name}.yaml")
55
+ weights_path = os.path.join("weights", f"{proj_name}.pth")
56
+ # Load Talk2DINO projection layer
57
+ talk2dino = ProjectionLayer.from_config(config_path)
58
+ talk2dino.load_state_dict(torch.load(weights_path, map_location=device))
59
+ talk2dino.to(device)
60
+ # Load CLIP model
61
+ clip_model, clip_preprocess = clip.load("ViT-L/14", device=device, jit=False)
62
+ tokenizer = clip.tokenize
63
+ # Example: Tokenize and project text features
64
+ texts = ["a cat"]
65
+ text_tokens = tokenizer(texts).to(device)
66
+ text_features = clip_model.encode_text(text_tokens)
67
+ projected_text_features = talk2dino.project_clip_txt(text_features)
68
+ ```
69
+
70
+ ### Demo
71
+ In `demo.py` we provide a simple example on how to use Talk2DINO for inference on a given image with custom textual categories. Run
72
+
73
+ ```bash
74
+ python demo.py --input custom_input_image --output custom_output_seg [--with_background] --textual_categories category_1,category_2,..
75
+ ```
76
+
77
+ Example:
78
+ ```bash
79
+ python demo.py --input assets/pikachu.png --output pikachu_seg.png --textual_categories pikachu,traffic_sign,forest,route
80
+ ```
81
+
82
+ Result:
83
+ <div align="center">
84
+ <table><tr><td><figure>
85
+ <img alt="" src="./assets/pikachu.png" width=300>
86
+ </figure></td><td><figure>
87
+ <img alt="" src="./pikachu_seg.png" width=300>
88
+ </figure></td></tr></table>
89
+ </div>
90
+
91
+ ## Installation
92
+ ```bash
93
+ # Create a new environment with Python 3.10
94
+ conda create --name talk2dino python=3.10 -c conda-forge
95
+ conda activate talk2dino
96
+ # Install compilers for C++/CUDA extensions
97
+ conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"
98
+ # Install CUDA toolkit and cuDNN
99
+ conda install -c nvidia/label/cuda-11.7.0 cuda
100
+ conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
101
+ conda install -c conda-forge cudnn cudatoolkit=11.7.0
102
+ # Install PyTorch 2.1 with CUDA 11.8 support
103
+ # Note: This is crucial, as it matches the requirements of mmcv-full 1.7.2
104
+ pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
105
+ # Install other dependencies
106
+ pip install -r requirements.txt
107
+ pip install -U openmim
108
+ mim install mmengine
109
+ # Install a compatible version of mmcv-full (1.7.2) for PyTorch 2.1
110
+ pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html
111
+ # Install mmsegmentation
112
+ pip install mmsegmentation==0.30.0
113
+ ```
114
+
115
+ <details>
116
+ <summary>Qualitative Results</summary>
117
+
118
+ | **Image** | **Ground Truth** | **FreeDA** | **ProxyCLIP** | **CLIP-DINOiser** | **Ours (Talk2DINO)** |
119
+ |-----------|------------------|------------|---------------|-------------------|------------------|
120
+ | ![Image](assets/qualitatives/voc/2_img.jpg) | ![Ground Truth](assets/qualitatives/voc/2_gt.png) | ![FreeDA](assets/qualitatives/voc/2_freeda.png) | ![ProxyCLIP](assets/qualitatives/voc/2_proxy.png) | ![CLIP-DINOiser](assets/qualitatives/voc/2_clipdinoiser.png) | ![Ours](assets/qualitatives/voc/2_talk2dino.png) |
121
+ | ![Image](assets/qualitatives/object/2r_img.png) | ![Ground Truth](assets/qualitatives/object/2r_gt.png) | ![FreeDA](assets/qualitatives/object/2r_freeda.png) | ![ProxyCLIP](assets/qualitatives/object/2r_proxy.png) | ![CLIP-DINOiser](assets/qualitatives/object/2r_clipdinoiser.png) | ![Ours](assets/qualitatives/object/2r_talk2dino.png) |
122
+ | ![Image](assets/qualitatives/cityscapes/1r_image.png) | ![Ground Truth](assets/qualitatives/cityscapes/1r_gt.png) | ![FreeDA](assets/qualitatives/cityscapes/1r_freeda.png) | ![ProxyCLIP](assets/qualitatives/cityscapes/1r_proxyclip.png) | ![CLIP-DINOiser](assets/qualitatives/cityscapes/1r_clipdinoiser.png) | ![Ours](assets/qualitatives/cityscapes/1r_talk2dino.png) |
123
+ | ![Image](assets/qualitatives/context/1r_img.png) | ![Ground Truth](assets/qualitatives/context/1r_gt.png) | ![FreeDA](assets/qualitatives/context/1r_freeda.png) | ![ProxyCLIP](assets/qualitatives/context/1r_proxy.png) | ![CLIP-DINOiser](assets/qualitatives/context/1r_clipdinoiser.png) | ![Ours](assets/qualitatives/context/1r_talk2dino.png) |
124
+ </details>
125
+
126
+
127
+ ## Reference
128
+ If you found this code useful, please cite the following paper:
129
+ ```
130
+ @misc{barsellotti2024talkingdinobridgingselfsupervised,
131
+ title={Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation},
132
+ author={Luca Barsellotti and Lorenzo Bianchi and Nicola Messina and Fabio Carrara and Marcella Cornia and Lorenzo Baraldi and Fabrizio Falchi and Rita Cucchiara},
133
+ year={2024},
134
+ eprint={2411.19331},
135
+ archivePrefix={arXiv},
136
+ primaryClass={cs.CV},
137
+ url={https://arxiv.org/abs/2411.19331},
138
+ }
139
+ ```