Learning monocular 3D reconstruction of articulated categories from motion

Filippos Kokkinos

Iasonas Kokkinos

University College London (UCL)

In CVPR 2021

[Paper]

[Supplementary]

[Code]

[Bibtex]

We tackle the problem of monocular 3D reconstruction for articulated object categories by guiding the deformation of a mesh template (top) through a sparse set of 3D control points regressed by a network given a single image (middle). Despite using only weak supervision in the form of keypoints, masks and video-based correspondence our approach is able to capture broad articulations, such as opening wings, motion of the lower limbs and neck (bottom).

Monocular 3D reconstruction of articulated object categories is challenging due to the lack of training data and the inherent ill-posedness of the problem. In this work we use video self-supervision, forcing the consistency of consecutive 3D reconstructions by a motion-based cycle loss. This largely improves both optimization-based and learning-based 3D mesh reconstruction. We further introduce an interpretable model of 3D template deformations that controls a 3D surface through the displacement of a small number of local, learnable handles. We formulate this operation as a structured layer relying on mesh-laplacian regularization and show that it can be trained in an end-to-end manner. We finally introduce a per-sample numerical optimisation approach that jointly optimises over mesh displacements and cameras within a video, boosting accuracy both for training and also as test time post-processing. While relying exclusively on a small set of videos collected per category for supervision, we obtain state-of-the-art reconstructions with diverse shapes, viewpoints and textures for multiple articulated object categories.

Follow up work on self-supervised 3D reconstruction can be found at https://fkokkinos.github.io/to_the_point/ .

Comparisons with prior work

Results

Acknowledgements

This webpage template was copied from https://richzhang.github.io/colorization/.