Improve model card for ZeroStereo (StereoGen): Add tags, paper info, links, and usage (#1)
Browse files- Improve model card for ZeroStereo (StereoGen): Add tags, paper info, links, and usage (38f61bd3d6e09c07fb02ee3ca58dbba7f7628dac)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,3 +1,93 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: image-to-image
|
| 4 |
+
library_name: diffusers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ZeroStereo: Zero-shot Stereo Matching from Single Images
|
| 8 |
+
|
| 9 |
+
This repository hosts the **StereoGen** model, a key component of the ZeroStereo framework. ZeroStereo introduces a novel pipeline for zero-shot stereo matching, capable of synthesizing high-quality right images from arbitrary single images. It achieves this by leveraging pseudo disparities generated by a monocular depth estimation model and fine-tuning a diffusion inpainting model to recover missing details while preserving semantic structure.
|
| 10 |
+
|
| 11 |
+
## Paper
|
| 12 |
+
|
| 13 |
+
The model was presented in the paper [ZeroStereo: Zero-shot Stereo Matching from Single Images](https://huggingface.co/papers/2501.08654).
|
| 14 |
+
|
| 15 |
+
### Abstract
|
| 16 |
+
|
| 17 |
+
State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow.
|
| 18 |
+
|
| 19 |
+
## Code
|
| 20 |
+
|
| 21 |
+
The official code, detailed instructions for fine-tuning, generation, training, and evaluation, can be found on the [GitHub repository](https://github.com/Windsrain/ZeroStereo).
|
| 22 |
+
|
| 23 |
+

|
| 24 |
+
|
| 25 |
+
## Pre-Trained Models
|
| 26 |
+
|
| 27 |
+
The following pre-trained models are available related to this project:
|
| 28 |
+
|
| 29 |
+
| Model | Link |
|
| 30 |
+
| :-: | :-: |
|
| 31 |
+
| SDv2I | [Download 🤗](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting/tree/main) |
|
| 32 |
+
| StereoGen | [Download 🤗](https://huggingface.co/Windsrain/ZeroStereo/tree/main/StereoGen) |
|
| 33 |
+
| Zero-RAFT-Stereo | [Download 🤗](https://huggingface.co/Windsrain/ZeroStereo/tree/main/Zero-RAFT-Stereo)|
|
| 34 |
+
| Zero-IGEV-Stereo | [Download 🤗](https://huggingface.co/Windsrain/ZeroStereo/tree/main/Zero-IGEV-Stereo)|
|
| 35 |
+
|
| 36 |
+
## Usage
|
| 37 |
+
|
| 38 |
+
You can load the StereoGen model using the `diffusers` library. Please note that for full inference functionality involving detailed pre-processing (e.g., input image, depth maps, masks, etc.), you should refer to the official GitHub repository as the process might involve multiple steps.
|
| 39 |
+
|
| 40 |
+
First, ensure you have the `diffusers` library and its dependencies installed:
|
| 41 |
+
```bash
|
| 42 |
+
pip install diffusers transformers accelerate torch
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
Here's a basic example to load the pipeline:
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
import torch
|
| 49 |
+
from diffusers import DiffusionPipeline
|
| 50 |
+
from PIL import Image
|
| 51 |
+
|
| 52 |
+
# Load the StereoGen pipeline.
|
| 53 |
+
# This model synthesizes a right stereo image from a single left input image.
|
| 54 |
+
pipeline = DiffusionPipeline.from_pretrained("Windsrain/ZeroStereo", torch_dtype=torch.float16)
|
| 55 |
+
|
| 56 |
+
# Move pipeline to GPU if available
|
| 57 |
+
if torch.cuda.is_available():
|
| 58 |
+
pipeline.to("cuda")
|
| 59 |
+
|
| 60 |
+
# Example placeholder for input image.
|
| 61 |
+
# Replace with your actual left input image.
|
| 62 |
+
# For full usage (e.g., generating required depth maps or masks),
|
| 63 |
+
# please refer to the project's GitHub repository.
|
| 64 |
+
# input_image = Image.open("path/to/your/left_image.png").convert("RGB")
|
| 65 |
+
input_image = Image.new('RGB', (512, 512), color = 'blue') # Dummy image for demonstration
|
| 66 |
+
|
| 67 |
+
print("Model loaded successfully. For detailed inference and generation scripts,")
|
| 68 |
+
print("refer to the official GitHub repository: https://github.com/Windsrain/ZeroStereo")
|
| 69 |
+
|
| 70 |
+
# The actual inference call to generate a stereo image might require specific inputs
|
| 71 |
+
# (e.g., `image`, `depth_map`, `mask_image`) depending on the pipeline's internal
|
| 72 |
+
# implementation as shown in the project's GitHub demo/generation scripts.
|
| 73 |
+
# Example inference might look like:
|
| 74 |
+
# generated_right_image = pipeline(image=input_image, depth_map=some_depth, mask_image=some_mask).images[0]
|
| 75 |
+
# generated_right_image.save("generated_stereo_right.png")
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
## Acknowledgement
|
| 79 |
+
|
| 80 |
+
This project is based on [MfS-Stereo](https://github.com/nianticlabs/stereo-from-mono), [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2), [Marigold](https://github.com/prs-eth/Marigold), [RAFT-Stereo](https://github.com/princeton-vl/RAFT-Stereo), and [IGEV-Stereo](https://github.com/gangweix/IGEV). We thank the original authors for their excellent works.
|
| 81 |
+
|
| 82 |
+
## Citation
|
| 83 |
+
|
| 84 |
+
If you find this work helpful, please cite the paper:
|
| 85 |
+
|
| 86 |
+
```bibtex
|
| 87 |
+
@article{wang2025zerostereo,
|
| 88 |
+
title={ZeroStereo: Zero-shot Stereo Matching from Single Images},
|
| 89 |
+
author={Wang, Xianqi and Yang, Hao and Xu, Gangwei and Cheng, Junda and Lin, Min and Deng, Yong and Zang, Jinliang and Chen, Yurui and Yang, Xin},
|
| 90 |
+
journal={arXiv preprint arXiv:2501.08654},
|
| 91 |
+
year={2025},
|
| 92 |
+
}
|
| 93 |
+
```
|