MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

Autor:	Yundong Li, Xiaokun Wei
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	Electronic computers. Computer science QA75.5-76.95 Cybernetics Q300-390
Zdroj:	Applied Artificial Intelligence, Vol 38, Iss 1 (2024)
Druh dokumentu:	article
ISSN:	08839514 1087-6545 0883-9514
DOI:	10.1080/08839514.2024.2364159
Popis:	As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/f60bc44f092e4a1791f3d4bd3f754f50 Zobrazit plný text záznamu View record in DOAJ