Popis: |
Visual scene understanding studies the task of representing a captured scene in a manner emulating human-like understanding of that space. Considering indoor scenes are designed for human use and are utilised everyday, attaining this understanding is crucial for applications such as robotic mapping and navigation, smart home and security systems, and home healthcare and assisted living. However, although we as humans utilise such spaces in our day-to-day lives, analysis of human activity is not commonly applied towards enhancing indoor scene-level understanding. As such, the work presented in this thesis investigates the benefits of including human activity information in indoor scene understanding challenges, aiming to demonstrate its potential contributions, applications, and versatility. The first contribution of this thesis utilises human activity to reveal scene regions occluded behind objects and clutter. Human poses recognised from a static sensor are projected into a top-down scene representation recording belief of human activity over time. This representation is applied to carve a volumetric scene map, initialised on captured depth, to expose the occupancy of hidden scene regions. An object detection approach exploits the revealed occluded scene occupancy to localise self-, partially-, and, significantly, fully-occluded objects. The second contribution extends the top-down activity representation to predict the functionality of major scene surfaces from human activity recognised in 360 degree video. A convolutional network is trained on simulated human activity to segment walkable, sittable, and interactable surfaces from the top-down perspective. This prediction is applied to construct a complete scene 3D approximation, with results showing scene structure and surface functionality are predicted well from human activity alone. Finally, this thesis investigates an association between the top-down functionality prediction and the captured visual scene. A new dataset capturing long-term human activity is introduced to train a model on combined activity and visual scene information. The model is trained to segment functional scene surfaces from the capture sensor perspective, with evaluation establishing that the introduction of human activity information can improve functional surface segmentation performance. Overall, the work presented in this thesis demonstrates that analysis of human activity can be applied to enhance indoor scene understanding across various challenges, sensors, and representations. Assorted datasets are introduced alongside the major contributions to motivate further investigation into its application. |