Zobrazeno 1 - 3
of 3
pro vyhledávání: '"Ging, Simon"'
The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks
Externí odkaz:
http://arxiv.org/abs/2402.07270
Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to t
Externí odkaz:
http://arxiv.org/abs/2211.12914
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to
Externí odkaz:
http://arxiv.org/abs/2011.00597