Accurate 3D Pose Estimation From a Single Depth Image

Mao Ye [1], Xianwang Wang [2], Ruigang Yang [1], Liu Ren [3] and Marc Pollefeys [4]

[1] Center for Visualization and Virtual Environments, University of Kentucky

[2] HP Labs, Palo Alto

[3] Bosch Research

[4] ETH Zürich


Figure 1. Examples of estimation results using pose tracking algorithms in [2] ((a) and (c)) and our method ((b) and (d)), from depth images captured by Kinect.



Project Abstract

We present a novel system to estimate body pose configuration from a single depth map.  It combines both pose detection and pose refinement.  The input depth map is matched with a set of pre-captured motion exemplars to generate a body configuration estimation, as well as semantic labeling of the input point cloud.   The initial estimation is then refined by directly fitting the body configuration with the observation (e.g., the input depth). In addition to the new system architecture, our other contributions include modifying a point cloud smoothing technique to deal with very noisy input depth maps, a point cloud alignment and pose search algorithm that is view-independent and efficient.  Experiments on a public dataset show that our approach achieves significantly higher accuracy than previous state-of-art methods.


Algorithm Overview

Given a point cloud, we first remove irrelevant objects based on distance information, for which we use two fixed distance thresholds representing the interested distance range throughout our test.  A modified surface reconstruction algorithm is applied to remove noise. Then the cleaned point cloud is transformed into a canonical coordinate frame in order to remove viewpoint dependency, and a similar pose is identified in our motion database.  Then a refined pose configuration is estimated through non-rigid registration between the input and the rendered depth map for the corresponding pose. We rely on database exemplars and a shape completion method to deal with large occlusions, i.e., missing body parts.   Finally a failure detection and recovery mechanism is adopted to handle occasional failures from previous steps, using the temporal information.


Experimental Results

1.  Quantitative comparison with HC+EP Method [1] on publicly available dataset [1]

Overall mean error:  38mm (ours) vs. 100mm ([1])

2.  Qualitative comparison with OpenNI [2]


Related Publications

Accurate 3D Pose Estimation from a Single Depth Image (pdf

, video, poster)

Mao Ye, Xianwang Wang, Ruigang Yang, Liu Ren, Marc Pollefeys

International Conference on Computer Vision, 2011

Clarification: In table 1 in this paper, the numbers for our method and [3] ([21] in the paper) are actually obtained through experiments on different datasets. Our method is tested on the publicly available dataset [2]; while the method from [3] are tested on their synthetic data, which has less noises but larger varieties in terms of poses. This comparison (table 1) in the paper is therefore not quite appropriate.



[1] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Real time motion capture using a single time-of-flight camera. CVPR2010.

[2] Primesense. OpenNI.

[3] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from a single depth image. CVPR 2011.



This work is supported in part by University of Kentucky Research Foundation, US National Science Foundation award IIS-0448185, CPA-0811647, MRI0923131,  Microsoft’s  ETH-EPFL  Innovation  Cluster  for Embedded Software (ICES), as well as the EC’s FP7 European Research Council grant 4DVIDEO (n◦ 210806).