SalientDSO: Bringing Attention to Direct Sparse Odometry

Autor: Nitin J. Sanket, Yiannis Aloimonos, Huai-Jen Liang, Cornelia Fermüller
Rok vydání: 2018
Předmět:
DOI: 10.48550/arxiv.1803.00127
Popis: Although cluttered indoor scenes have a lot of useful high-level semantic information which can be used for mapping and localization, most visual odometry (VO) algorithms rely on the usage of geometric features such as points, lines, and planes. Lately, driven by this idea, the joint optimization of semantic labels and estimating odometry has gained popularity in the robotics community. This joint optimization method is accurate but is generally very slow. At the same time, in the vision community, direct and sparse approaches for VO have stricken the right balance between speed and accuracy. We merge the successes of these two communities and present a preprocessing method to incorporate semantic information in the form of visual saliency to direct sparse odometry (DSO)—a highly successful direct sparse VO algorithm. We also present a framework to filter the visual saliency based on scene parsing. Our framework SalientDSO relies on the widely successful deep learning-based approaches for visual saliency and scene parsing, which drives the feature selection for obtaining highly accurate and robust VO even in the presence of as few as 40 point features per frame. We provide an extensive quantitative evaluation of SalientDSO on the ICL-NUIM and the TUM monoVO data sets and show that we outperform DSO and ORB-simultaneous localization and mapping—two very popular state-of-the-art approaches in the literature. We also collect and publicly release a CVL-UMD data set which contains two indoor cluttered sequences on which we show qualitative evaluations. To the best of our knowledge, this is the first paper to use visual saliency and scene parsing to drive the feature selection in direct VO. Note to Practitioners —The algorithm of estimating the camera motion from a set of moving camera frames/images is commonly called VO. This problem has many applications ranging from building a 3-D map of the scene for the robot to navigate, grasp, and so on. Any VO algorithm must be fast, robust, and with low drift (low accumulation in error). These desired functions are generally obtained by selecting “good” features in an image, which, in the computer vision sense, turns out to be “corners.” However, when we constrain the setting to an indoor scene with a lot of clutter, we have a lot of objects which can be used to obtain “good” features from both a computer vision sense and a conceptual sense. We use this philosophy and present a preprocessing method to select better features as compared to a traditional VO pipeline using only geometric features and improve the robustness of the state-of-the-art VO method: direct sparse odometry, obtaining more accurate and robust results even with the lesser number of features. We evaluate our methods on three different data sets: ICL-NUIM, TUM monoVO, and CVL-UMD. We collected a custom dataset we call CVL-UMD to demonstrate the robustness of our approach, namely, SalientDSO in cluttered indoor scenes.
Databáze: OpenAIRE