Disentangled Generation and Aggregation for Robust Radiance Fields

ECCV 2024

Shihe Shen^1, Huachen Gao^1, Wangze Xu¹, Rui Peng^1,2, Luyang Tang^1,2, Kaiqiang Xiong^1,2, Jianbo Jiao³, Ronggang Wang^1,2

¹School of Electronic and Computer Engineering, Peking University
²Peng Cheng Laboratory
³School of Computer Science, University of Birmingham
^* Denotes Equal Contribution

Paper

Code (Coming Soon)

Supplementary

Abstract

The utilization of the triplane-based radiance fields has gained attention in recent years due to its ability to effectively disentangle 3D scenes with a high-quality representation and low computation cost. A key requirement of this method is the precise input of camera poses. However, due to the local update property of the triplane, a similar joint estimation as previous joint pose-NeRF optimization works easily results in local-minima. To this end, we propose the Disentangled Triplane Generation module to introduce global feature context and smoothness into triplane learning, which mitigates errors caused by local updating. Then, we propose the Disentangled Plane Aggregation to mitigate the entanglement caused by the common triplane feature aggregation during camera pose updating. In addition, we introduce a two-stage warm-start training strategy to reduce the implicit constraints caused by the triplane generator. Quantitative and qualitative results demonstrate that our proposed method achieves state-of-the-art performance in novel view synthesis with noisy or unknown camera poses, as well as efficient convergence of optimization.

Method

Our method can be divided into two stages. In the first stage, we input random triplane noise to the proposed triplane generator to generate different feature grids for scene representation while optimizing the camera poses. Features interpolated from different planes are aggregated through the proposed DPA. Colors and densities are derived from an MLP decoder. In the second stage, we discard the generator and transform it to direct updates on triplane feature grids. We adopt a warm-start training strategy for both efficient training and better scene representation.