Movie Visual and Speech Analysis Through Multi-Modal LLM for Recommendation Systems

Autor: Peixuan Qi
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: IEEE Access, Vol 12, Pp 145686-145702 (2024)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2024.3471568
Popis: Understanding speech as a component of broader video comprehension within audio-visual large language models remains a critical yet underexplored area. Previous research has predominantly tackled this challenge by adapting models developed for conventional video classification tasks, such as action recognition or event detection. However, these models often overlook the linguistic elements present in videos, such as narrations or dialogues, which can implicitly convey high-level semantic information related to movie understanding, including narrative structure or contextual background. Moreover, existing methods are generally configured to encode the entire video content, which can lead to inefficiencies in genre classification tasks. In this paper, we propose a multi-modal Large Language Model (LLM) framework, termed Visual-Speech Multimodal LLM (VSM-LLM), for analyzing movie visual and speech data to predict movie genre. The model incorporates an advanced MGC Q-Former architecture, enabling fine-grained, temporal alignment of audio-visual features across various time scales. On the MovieNet dataset, VSM-LLM attains 40.3% and 55.3% in macro and micro recall@0.5, respectively, outperforming existing baselines. On the Condensed Movies dataset, VSM-LLM achieves 43.5% in macro recall@0.5 and 53.5% in micro recall@0.5, further confirming its superior genre classification performance.
Databáze: Directory of Open Access Journals