TransFuseGrid: Transformer-based Lidar-RGB fusion for semantic grid prediction

Autor: Gustavo Salazar-Gomez, David Sierra-Gonzalez, Manuel Diaz-Zapata, Anshul Paigwar, Wenqian Liu, Ozgur Erkent, Christian Laugier
Přispěvatelé: Robots coopératifs et adaptés à la présence humaine en environnements (CHROMA), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), Université Grenoble Alpes (UGA), Université de Lyon-Institut National des Sciences Appliquées (INSA), Hacettepe University = Hacettepe Üniversitesi
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: ICARCV 2022-17th International Conference on Control, Automation, Robotics and Vision
ICARCV 2022-17th International Conference on Control, Automation, Robotics and Vision, Dec 2022, Singapore, Singapore. pp.1-6
Popis: International audience; Semantic grids are a succinct and convenient approach to represent the environment for mobile robotics and autonomous driving applications. While the use of Lidar sensors is now generalized in robotics, most semantic grid prediction approaches in the literature focus only on RGB data. In this paper, we present an approach for semantic grid prediction that uses a transformer architecture to fuse Lidar sensor data with RGB images from multiple cameras. Our proposed method, TransFuseGrid, first transforms both input streams into topview embeddings, and then fuses these embeddings at multiple scales with Transformers. Finally, a decoder transforms the fused, top-view feature map into a semantic grid of the vehicle's environment. We evaluate the performance of our approach on the nuScenes dataset for the vehicle, drivable area, lane divider and walkway segmentation tasks. The results show that Trans-FuseGrid achieves superior performance than competing RGBonly and Lidar-only methods. Additionally, the Transformer feature fusion leads to a significative improvement over naive RGB-Lidar concatenation. In particular, for the segmentation of vehicles, our model outperforms state-of-the-art RGB-only and Lidar-only methods by 24% and 53%, respectively.
Databáze: OpenAIRE