Depth Any Camera

Demonstrations on Fisheye Videos and 360^o Singe-View Reconstruction

The zero-shot metric depth estimation results of Depth Any Camera (DAC) are visualized on ScanNet++ fisheye videos and compared to Metric3D-v2. The visualizations of A.Rel error against ground truth highlight the superior performance of DAC. Additionally, we showcase DAC's application on 360-degree images, where a single forward pass of depth estimation enables full 3D scene reconstruction.

Fisheye Depth Estimation on ZipNeRF for Neural Reconstruction

To support the development of 3D Neural Reconstruction methods such as NeRF and Gaussian Splatting on fisheye inputs, we provide DAC's depth estimation results on ZipNeRF for fisheye images. The depth maps are available for download at Fisheye Depth ZipNeRF and Fisheye Depth ScanNet++. Emerging neural reconstruction methods capable of handling large FoV inputs, such as SMERF, Fisheye-GS, and EVER, are expected to benefit from the fisheye depth prior.

Abstract

While recent depth foundation models exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types—particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras—remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.

Data Coverage

DAC is trained on a combination set of 3 labeled datasets (670k images) for indoor model and a combination of 2 datasets (130k) for outdoor model. Two 360 datasets and two fisheye datasets are used for zero-shot testing.

Zero-shot Metric Depth Estimation

Depth Any Camera (DAC) performs significantly better than the previously SoTA metric depth estimation models Metric3D-v2 and UniDepth in zero-shot generalization to large FoV camera images given significantly smaller training dataset and model size.

Framework

The framework of Depth Any Camera is shown below. Our DAC framework converts data from any camera type into a canonical ERP space, enabling a model trained solely on perspective images to process large-FoV test data consistently for metric depth estimation. During training, we introduce an effective pitch-aware Image-to-ERP conversion with online data augmentation to simulate high-distortion regions unique to large-FoV images. The proposed FoV-Align process normalizes diverse-FoV data to a predefined ERP patch size, maximizing training efficiency. During inference, images from any camera type are converted into ERP space for depth estimation, with an optional step to map the ERP output back to the original image space for visualization.

Citation

@inproceedings{guo2025depthanycamera,
  title={Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera},
  author={Yuliang Guo and Sparsh Garg and S. Mahdi H. Miangoleh and Xinyu Huang and Liu Ren},
  booktitle={CVPR},
  year={2025}
}