SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction
ECCV 2024


Abstract

Monocular 3D reconstruction for categorical objects heavily relies on accurately perceiving each object's pose. While gradient-based optimization in a NeRF framework updates the initial pose, this paper highlights that scale-depth ambiguity in monocular object reconstruction causes failures when the initial pose deviates moderately from the true pose. Consequently, existing methods often depend on a third-party 3D object to provide an initial object pose, leading to increased complexity and generalization issues. To address these challenges, we present SUP-NeRF, a streamlined Unification of object Pose estimation and NeRF-based object reconstruction. SUP-NeRF decouples the object's dimension estimation and pose refinement to resolve the scale-depth ambiguity, and introduces a camera-invariant projected-box representation that generalizes cross different domains. While using a dedicated pose estimator that smoothly integrates into an object-centric NeRF, SUP-NeRF is free from external 3D detectors. SUP-NeRF achieves state-of-the-art results in both reconstruction and pose estimation tasks on the nuScenes dataset. Furthermore, SUP-NeRF exhibits exceptional cross-dataset generalization on the KITTI and Waymo datasets, surpassing prior methods with up to 50% reduction in rotation and translation error.


Devils in Scale-Depth Ambiguity

Given the input image (left), joint optimization of pose, shape, and texture in NeRF has full freedom to rescale the shape within the normalized shape space (blue box) or move the 3D box. Such phenomenon is observed from the evolution of the rendered objects from iteration 0 (middle) to iteration 50 (right).



Pipeline

SUP-NeRF unifies pose estimation and NeRF. The pose estimation module enables SUP-NeRF to work for objects in diverse poses without external 3D detectors. The increase of complexity only constitutes a few MLP layers.


Camera-Invariant Pose Estimation Module

The pose estimation module of SUP-NeRF iteratively updates the object’s pose while preserving scale. It takes the projection of 3D box corners as a visual representation of the input pose and estimate the pose update via comparing it to observed image in a latent embedding space. These designs handle scale-depth ambiguity and make the deep refiner independent from camera intrinsic parameters for better cross-domain generalization.


Visual Comparison

SUP-NeRF executes pose estimation reliably, fast converging from a random initial pose to the true one, and enables neural reconstruction under diverse object poses, occlusion cases under this cross-dataset setup. SUP-NeRF is visually compared to the other major competitors, demonstrates sharper rendered image, higher accuracy in shape and pose.

Citation