Abstrakt: |
With the substantial surge in available internet video data, the intricate task of video summarization has consistently attracted the computer vision research community to summarize the videos meaningfully. Many recent summarization techniques leverage bidirectional long short-term memory for its proficiency in modeling temporal dependencies. However, its effectiveness is limited to short-duration video clips, typically up to 90 to 100 frames. To address this constraint, the proposed approach incorporates global and local multi-head attention, effectively capturing temporal dependencies at both global and local levels. This enhancement enables parallel computation, thereby improving overall performance for longer videos. This work considers video summarization as a supervised learning task and introduces a deep summarization architecture called multi-head attention with reinforcement learning (MHA-RL). The architecture comprises a pretrained convolutional neural network for extracting features from video frames, along with global and local multi-head attention mechanisms for predicting frame importance scores. Additionally, the network integrates an RL-based regressor network to consider the diversity and representativeness of the generated video summary. Extensive experimentation is conducted on benchmark datasets, such as TVSum and SumMe. The proposed method exhibits improved performance compared to the majority of state-of-the-art summarization techniques, as indicated by both qualitative and quantitative results. [ABSTRACT FROM AUTHOR] |