Popis: |
Road extraction is a typical task in the semantic segmentation of remote sensing images, and one of the most efficient techniques for solving this task in recent years is the vision transformer technique. However, roads typically exhibit features such as uneven scales and low signal-to-noise ratios, which can be understood as the asymmetry between the road and the background category and the asymmetry in the transverse and longitudinal shape of the road. Existing vision transformer models, due to their fixed sliding window mechanism, cannot adapt to the uneven scale issue of roads. Additionally, self-attention, based on fully connected mechanisms for long sequences, may suffer from attention deviation due to excessive noise, making it unsuitable for low signal-to-noise ratio scenarios in road segmentation, resulting in incomplete and fragmented road segmentation results. In this paper, we propose a road extraction based on deformable self-attention computation, termed DOCswin-Trans (Deformable and Overlapped Cross-Window Transformer), to solve these problems. On the one hand, we develop a DOC-Transformer block to address the scale imbalance issue, which can utilize the overlapped window strategy to preserve the overall contextual semantic information of roads as much as possible. On the other hand, we propose a deformable window strategy to adaptively resample input vectors, which can direct attention automatically to the foreground areas relevant to roads and thereby address the low signal-to-noise ratio problem. We evaluate the proposed method on two popular road extraction datasets (i.e., DeepGlobe and Massachusetts datasets). The experimental results demonstrate that the proposed method outperforms baseline methods. On the DeepGlobe dataset, the proposed method achieves an IoU improvement ranging from 0.63% to 5.01% compared to baseline methods. On the Massachusetts dataset, our method achieves an IoU improvement ranging from 0.50% to 6.24% compared to baseline methods. |