Self-supervised learning of structural representations of visual objects

Autor:	Jakab, T
Přispěvatelé:	Vedaldi, A
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	Computer vision
Popis:	This thesis explores how a computer can learn the structure of visual objects in the absence of strong supervision using self-supervised learning. We demonstrate that we can learn structural representations of objects using an autoencoding framework with reconstruction as the key learning signal. We do this by engineering bottlenecks that disentangle object structure from other factors of variation. Moreover, we design the bottlenecks to represent the object structure in the form of 2D and 3D object landmarks or 3D mesh. Specifically, we develop a method that automatically discovers 2D object landmarks without any annotations using a conditional autoencoder with 2D keypoint bottleneck that disentangles pose, represented as 2D keypoints, and appearance. Despite the ability of self-supervised learning methods to learn stable object landmarks, the automatically discovered landmarks are not aligned with landmarks that would be annotated by human annotators. To address this, we present a method that can inject an unpaired empirical prior into a conditional autoencoder by introducing a novel landmark autoencoding that can leverage powerful image discriminators used in adversarial learning. A by-product of these conditional autoencoding methods is that the generation can be interactively controlled by manipulating the keypoints in the bottleneck. We leverage this feature in a novel method for interactive 3D shape deformation. The method is trained in a self-supervised way to use automatically discovered 3D landmarks to align pairs of 3D shapes. In the test time, the method allows the user to interactively deform the object shape via the discovered 3D object landmarks. Finally, we present a method that uses a photo-geometric autoencoder to recover 3D shape of an object category without any 3D annotations. It uses videos for training and learns to disentangle an image input into a rigid pose, texture and deformable shape model.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od______1064::9bb8b25d707cd22843c328572fbffc05 https://ora.ox.ac.uk/objects/uuid:422cfd39-34ac-4aa2-978f-b52e47010f0f Zobrazit plný text záznamu