UPNeRF: A Unified Framework for Monocular 3D Object Reconstruction and Pose Estimation


Abstract

Monocular 3D reconstruction for categorical objects heavily relies on accurately perceiving each object’s pose. While gradient-based optimization within a NeRF framework updates initially given poses, this paper highlights that such a scheme fails when the initial pose even moderately deviates from the true pose. Consequently, existing methods often depend on a third-party 3D object to provide an initial object pose, leading to increased complexity and generalization issues. To address these challenges, we present UPNeRF, a Unified framework integrating Pose estimation and NeRF-based reconstruction, bringing us closer to real-time monocular 3D object reconstruction. UPNeRF decouples the object’s dimension estimation and pose refinement to resolve the scale-depth ambiguity, and introduces an effective projected-box representation that generalizes well cross different domains. While using a dedicated pose estimator that smoothly integrates into an object-centric NeRF, UPNeRF is free from external 3D detectors. UPNeRF achieves state-of-the-art results in both reconstruction and pose estimation tasks on the nuScenes dataset. Furthermore, UPNeRF exhibits exceptional Cross-dataset generalization on the KITTI and Waymo datasets, surpassing prior methods with up to 50% reduction in rotation and translation error.


Pipeline

UPNeRF unifies pose estimation and NeRF. The pose estimation module enables UPNeRF to work for objects in diverse poses without external 3D detectors. The increase of complexity only constitutes a few MLP layers.


Camera-Invariant Pose Estimation Module

The pose estimation module of UPNeRF iteratively updates the object’s pose while preserving scale. It takes the projection of 3D box corners as a visual representation of the input pose and estimate the pose update via comparing it to observed image in a latent embedding space. These designs handle scale-depth ambiguity and make the deep refiner independent from camera intrinsic parameters for better cross-domain generalization.


Visual Comparison

UPNeRF executes pose estimation reliably, fast converging from a random initial pose to the true one, and enables neural reconstruction under diverse object poses, occlusion cases under this cross-dataset setup. UPNeRF is visually compared to the other major competitors, demonstrates sharper rendered image, higher accuracy in shape and pose.

Citation