Proxy-Cap Project Page

IEEE CVPR 2024

Real-time Monocular Full-body Capture in World Space via Sequential
Proxy-to-Motion Learning

Yuxiang Zhang¹, Hongwen Zhang^2, Liangxiao Hu³, Jiajun Zhang⁴, Hongwei Yi⁵, Shengping Zhang³, Yebin Liu^1 (* - corresponding author)

¹Tsinghua University
²Beijing Normal University
³Harbin Institute of Technology
⁴Beijing University of Posts and Telecommunications
⁵Max Planck Institute for Intelligent Systems, T̈ubingen, Germany

Fig 1. The proposed method, ProxyCap, is a real-time monocular full-body capture solution to produce accurate human motions with plausible foot-ground contact in world space.

Abstract

Learning-based approaches to monocular motion cap- ture have recently shown promising results by learning to regress in a data-driven manner. However, due to the chal- lenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full- body capture while being accurate in world space. In this work, we introduce ProxyCap, a human-centric proxy-to- motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy data enables us to build a learning- based network with accurate world-space supervision while also mitigating the generalization issues. For more accu- rate and physically plausible predictions in world space, our network is designed to learn human motions from a human-centric perspective, which enables the understand- ing of the same motion captured with different camera tra- jectories. Moreover, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations. With the proposed learning-based solu- tion, we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space even using hand-held moving cameras.

Overview

Fig 2. Illustration of the proposed method ProxyCap. Our method takes the estimated 2D skeletons from a sliding window as inputs and estimates the relative 3D motions in the human coordinate space. These local movements are accumulated frame by frame to recover the global 3D motions. For more accurate and physically plausible results, a contact-aware neural motion descent module is proposed to refine the initial motion predictions.

Results

Fig 3. Results across different cases in the (a,g) 3DPW [61], (b) EHF [40], and (c) Human3.6M [13] datasets and (d,e,f) internet videos. We demonstrate that our method can recover the accurate and plausible human motions in moving cameras at a real-time performance. Specifically, (g) demonstrates the robustness and the temporal coherence of our method even under the occlusion inputs.

Fig 4. Qualitative comparison with previous state-of-the-art methods: (a) PyMAF-X [72], (c) GLAMR [67], (b)(d) Ours.

Technical Paper

Demo Video

Citation

Yuxiang Zhang and Hongwen Zhang and Liangxiao Hu and Jiajun Zhang and Hongwei Yi and Shengping Zhang and Yebin Liu. "ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning"

@misc{zhang2023proxycap,
title={ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning},
author={Yuxiang Zhang and Hongwen Zhang and Liangxiao Hu and Jiajun Zhang and Hongwei Yi and Shengping Zhang and Yebin Liu},
year={2023},
booktitle = {IEEE International Conference on Computer Vision and Pattern Recognition, (CVPR)},
}