Depth Any Camera

Zero-Shot Metric Depth Estimation from Any Camera


Yuliang Guo1*†     Sparsh Garg2†    S. Mahdi H. Miangoleh3    Xinyu Huang1    Liu Ren1
1Bosch Research North America              2Carnegie Mellon University              3Simon Fraser University            
* Corresponding author         † Equal technical contribution

Depth Any Camera (DAC) is a powerful Zero-Shot Metric Depth Estimation framework that extends a perspective-trained model to handle any type of camera with varying FoVs effectively. Remarkably, DAC can be trained exclusively on perspective images, yet it generalizes seamlessly to fisheye and 360 cameras without requiring specialized training data. Key features includes:

  • Zero-shot metric depth estimation on fisheye and 360 images, significantly outperforming prior metric depth SoTA Metric3D-v2 and UniDepth
  • Geometry-focused training framework adaptable to any network archetecture, extendable to other 3D perception tasks
Tired of collecting new data and annotations for every new camera type? DAC allows you to leverage existing data, ensuring that every piece of previously collected 3D data valuable, regardless of the camera type used in a new application.

Demonstrations on Fisheye Videos and 360o Singe-View Reconstruction

The zero-shot metric depth estimation results of Depth Any Camera (DAC) are visualized on ScanNet++ fisheye videos and compared to Metric3D-v2. The visualizations of A.Rel error against ground truth highlight the superior performance of DAC. Additionally, we showcase DAC's application on 360-degree images, where a single forward pass of depth estimation enables full 3D scene reconstruction.


Visual comparison to Metric3D-v2 and UniDepth

Abstract

While recent depth foundation models exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types—particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras—remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.

Data Coverage

DAC is trained on a combination set of 3 labeled datasets (670k images) for indoor model and a combination of 2 datasets (130k) for outdoor model. Two 360 datasets and two fisheye datasets are used for zero-shot testing.

pipeline

Zero-shot Metric Depth Estimation

Depth Any Camera (DAC) performs significantly better than the previously SoTA metric depth estimation models Metric3D-v2 and UniDepth in zero-shot generalization to large FoV camera images given significantly smaller training dataset and model size.

pipeline

Framework

The framework of Depth Any Camera is shown below. Our DAC framework converts data from any camera type into a canonical ERP space, allowing a model trained on perspective images to process large FoV testing data in a consistent space for metric depth inference. During training, an efficient Image-to-ERP conversion is developed to enable online data augmentation directly in the ERP space, an approach widely proven effective with perspective images. With the proposed FoV-Align process, highly varied FoV data is adapted to a single predefined ERP patch size, maximizing training efficiency. During inference, images from any camera type can be converted to ERP space for metric depth estimation, with an optional step to map the ERP output back to the original image space for visualization.

pipeline

Citation

@inproceedings{guo2025depthanycamera,
  title={Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera},
  author={Yuliang Guo and Sparsh Garg and S. Mahdi H. Miangoleh and Xinyu Huang and Liu Ren},
  booktitle={arXiv},
  year={2025}
}