Výsledky vyhledávání - "Snoek, P. A."

Report

Beyond Coarse-Grained Matching in Video-Text Retrieval

Autor: Chen, Aozhu, Doughty, Hazel, Li, Xirong, Snoek, Cees G. M.

Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be appli

Externí odkaz: http://arxiv.org/abs/2410.12407

Zobrazit plný text záznamu

Report

LocoMotion: Learning Motion-Focused Video-Language Representations

Autor: Doughty, Hazel, Thoker, Fida Mohammad, Snoek, Cees G. M.

This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We i

Externí odkaz: http://arxiv.org/abs/2410.12018

Zobrazit plný text záznamu

Report

Learning to Ground VLMs without Forgetting

Autor: Bhowmik, Aritra, Derakhshani, Mohammad Mahdi, Koelma, Dennis, Oswald, Martin R., Asano, Yuki M., Snoek, Cees G. M.

Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Visual Language Models (VLMs) struggle at this task. In this paper, we introduce LynX, a framework that equips pretrained VLM

Externí odkaz: http://arxiv.org/abs/2410.10491

Zobrazit plný text záznamu

Report

TULIP: Token-length Upgraded CLIP

Autor: Najdenkoska, Ivona, Derakhshani, Mohammad Mahdi, Asano, Yuki M., van Noord, Nanne, Worring, Marcel, Snoek, Cees G. M.

We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on task

Externí odkaz: http://arxiv.org/abs/2410.10034

Zobrazit plný text záznamu

Report

TVBench: Redesigning Video-Language Evaluation

Autor: Cores, Daniel, Dorkenwald, Michael, Mucientes, Manuel, Snoek, Cees G. M., Asano, Yuki M.

Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been pro

Externí odkaz: http://arxiv.org/abs/2410.07752

Zobrazit plný text záznamu

Report

SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery

Autor: Rastegar, Sarah, Salehi, Mohammadreza, Asano, Yuki M., Doughty, Hazel, Snoek, Cees G. M.

In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short whe

Externí odkaz: http://arxiv.org/abs/2408.14371

Zobrazit plný text záznamu

Report

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

While many capabilities of language models (LMs) improve with increased training budget, the influence of scale on hallucinations is not yet fully understood. Hallucinations come in many forms, and there is no universally accepted definition. We thus

Externí odkaz: http://arxiv.org/abs/2408.07852

Zobrazit plný text záznamu

Report

SIGMA: Sinkhorn-Guided Masked Video Modeling

Autor: Salehi, Mohammadreza, Dorkenwald, Michael, Thoker, Fida Mohammad, Gavves, Efstratios, Snoek, Cees G. M., Asano, Yuki M.

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to

Externí odkaz: http://arxiv.org/abs/2407.15447

Zobrazit plný text záznamu

Report

GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features

Autor: Sträter, Luc P. J., Salehi, Mohammadreza, Gavves, Efstratios, Snoek, Cees G. M., Asano, Yuki M.

In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training set, like u

Externí odkaz: http://arxiv.org/abs/2407.12427

Zobrazit plný text záznamu

Report

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Autor: Nguyen, Duy-Kien, Assran, Mahmoud, Jain, Unnat, Oswald, Martin R., Snoek, Cees G. M., Chen, Xinlei

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by

Externí odkaz: http://arxiv.org/abs/2406.09415

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání