Title:
|
Dynamic scene modelling and representation from video and depth
|
Recent advances in sensor technology have introduced low-cost video+depth sensors, such as the Microsoft Kinect, which enable simultaneous acquisition of colour and depth images at video rates. The aim of this research is to investigate representations which support integration of noisy, partial surface measurements over time to form more complete, temporally coherent models of dynamic scenes with enhanced detail and reduced noise. The initial focus of this work is on the restricted case of rigid geometry for which online GPU-accelerated volumetric fusion is implemented and tested. An alternative fusion approach based on dense surface elements (surfels) is also explored and compared to the volumetric approach. As a first step towards handling non-rigid scenes, the static volumetric approach is extended to treat articulated (semi-rigid) geometry with a focus on humans. The human body is segmented into piece-wise rigid volumetric parts and part tracking is aided by depth-based skeletal motion data. To address scenes containing more general non-rigid geometry beyond people and isolated rigid shapes, a more flexible approach is required. A piece-wise modelling approach using a sparse surfel graph and repeated alternation between part segmentation, motion and shape estimation is proposed. The method is designed to incorporate methods for noise reduction and handling of missing data. Finally, a hybrid approach is proposed which leverages the advantages of the surfel graph segmentation and coarse surface modelling with the higher-resolution surface reconstruction capability of volumetric fusion. The hybrid method is able to produce a seamless skinned mesh structure to efficiently represent a temporally consistent dynamic scene. The hybrid framework can be considered a unification of rigid and non-rigid reconstruction techniques, for which static scenes are a special case. It allows arbitrary dynamic scenes to be efficiently represented with enhanced levels of detail and completeness where possible, but gracefully falls back to raw measurements where no structure can be inferred. The representation is shown to facilitate creative manipulation of real scene data which would previously require more complex capture setups or extensive manual processing.
|