Abstract
A physically-grounded visual backbone, $Φ$eat, is introduced to capture material identity through self-supervised training, demonstrating robust features invariant to external physical factors.
Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce Φeat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that Φeat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation (2025)
- OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer (2025)
- Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields (2025)
- Abstract 3D Perception for Spatial Intelligence in Vision-Language Models (2025)
- Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations (2025)
- Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge (2025)
- DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
