Depth Any Camera (DAC) is a powerful Zero-Shot Metric Depth Estimation framework that extends a perspective-trained model to handle any type of camera with varying FoVs effectively. Remarkably, DAC can be trained exclusively on perspective images, yet it generalizes seamlessly to fisheye and 360 cameras without requiring specialized training data. Key features includes:
The zero-shot metric depth estimation results of Depth Any Camera (DAC) are visualized on ScanNet++ fisheye videos and compared to Metric3D-v2. The visualizations of A.Rel error against ground truth highlight the superior performance of DAC. Additionally, we showcase DAC's application on 360-degree images, where a single forward pass of depth estimation enables full 3D scene reconstruction.
While recent depth foundation models exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types—particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras—remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.
DAC is trained on a combination set of 3 labeled datasets (670k images) for indoor model and a combination of 2 datasets (130k) for outdoor model. Two 360 datasets and two fisheye datasets are used for zero-shot testing.
Depth Any Camera (DAC) performs significantly better than the previously SoTA metric depth estimation models Metric3D-v2 and UniDepth in zero-shot generalization to large FoV camera images given significantly smaller training dataset and model size.
The framework of Depth Any Camera is shown below. Our DAC framework converts data from any camera type into a canonical ERP space, allowing a model trained on perspective images to process large FoV testing data in a consistent space for metric depth inference. During training, an efficient Image-to-ERP conversion is developed to enable online data augmentation directly in the ERP space, an approach widely proven effective with perspective images. With the proposed FoV-Align process, highly varied FoV data is adapted to a single predefined ERP patch size, maximizing training efficiency. During inference, images from any camera type can be converted to ERP space for metric depth estimation, with an optional step to map the ERP output back to the original image space for visualization.
@inproceedings{guo2025depthanycamera, title={Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera}, author={Yuliang Guo and Sparsh Garg and S. Mahdi H. Miangoleh and Xinyu Huang and Liu Ren}, booktitle={arXiv}, year={2025} }