Autor: |
Ran, Yuting, Fang, Bin, Chen, Lei, Wei, Xuekai, Xian, Weizhi, Zhou, Mingliang |
Předmět: |
|
Zdroj: |
Journal of Circuits, Systems & Computers; 3/15/2024, Vol. 33 Issue 4, p1-16, 16p |
Abstrakt: |
In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global–local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|