Popis: |
3D hand pose estimation aims at recovering 3D coordinates of joints or mesh vertices of hand from visual inputs. It has important applications in human-human or human-machine interactions and computer animation. Given the large view point change, cluttered background, finger similarity and self-occlusion, accurately estimating 3D hand pose in real time remains very challenging even after decades of research. In this thesis, we propose a series of learning based methods to 1) improve the estimation accuracy, 2) enable training with less or even no human annotation and 3) apply the estimation system to human-computer interaction with passive haptic feedback. The majority of the methods focuses on estimating pose of a single hand from depth data, with the only exception of human-computer interaction application, in which monochrome inputs are used. To improve the accuracy of hand pose estimation system, the first proposed method incorporates local geometric properties, e.g., surface normal and curvature, into random forest to achieve better invariance to view-point changes. Then, we propose another method reformulating hand pose estimation as regressing dense vector field with 2D fully convolutional network. Finally, we propose to estimate 3D coordinates of both joints and mesh vertices by establishing dense correspondence between input depth map and template mesh surface with just one forward pass of an highly efficient 2D fully convolutional network. Learning based hand pose estimation methods, especially deep learning based methods, requires large amount of accurate annotation on real samples to achieve high accuracy. However, acquiring such accurate annotated real samples can be extremely difficult and expensive. To mitigate the dependency on large amount of annotated real samples, we propose to leverage unlabelled real samples from two perspectives. First, we utilize deep generative models to formulate hand pose estimation in a semi-supervision setup. Then, we bridge model based and discriminative approaches to enable training network in a self-supervision way, i.e., by using the model fitting error to train the neural network. Finally, we demonstrate an interaction application in mixed reality with hand pose estimation system. While the system works on more energy efficient and cheaper monochrome camera, the training samples of which are automatically generated by the accurate depth based system. All these methods are thoroughly evaluated and compared with state-of-arts on a set of benchmarks. We then conclude and discuss their limitations and new ideas for future work. |