Zobrazeno 1 - 1
of 1
pro vyhledávání: '"Goulas, Andreas"'
To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model (LLM) that
Externí odkaz:
http://arxiv.org/abs/2412.17415