Zobrazeno 1 - 10
of 134
pro vyhledávání: '"Seo, Paul"'
Autor:
Lim, Sangbeom, Kim, Seongchan, An, Seungjun, Cho, Seokju, Seo, Paul Hongsuck, Kim, Seungryong
Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world sc
Externí odkaz:
http://arxiv.org/abs/2412.01471
Autor:
Shin, Heeseong, Kim, Chaehyun, Hong, Sunghwan, Cho, Seokju, Arnab, Anurag, Seo, Paul Hongsuck, Kim, Seungryong
Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic s
Externí odkaz:
http://arxiv.org/abs/2409.19846
We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS
Externí odkaz:
http://arxiv.org/abs/2407.07412
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time struc
Externí odkaz:
http://arxiv.org/abs/2404.03924
Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this
Externí odkaz:
http://arxiv.org/abs/2303.17811
Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the nee
Externí odkaz:
http://arxiv.org/abs/2303.16501
Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer) across various visual tasks. However, VL-driven segmentation has been under-explored, and t
Externí odkaz:
http://arxiv.org/abs/2303.14396
Autor:
Cho, Seokju, Shin, Heeseong, Hong, Sunghwan, Arnab, Anurag, Seo, Paul Hongsuck, Kim, Seungryong
Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably
Externí odkaz:
http://arxiv.org/abs/2303.11797
Autor:
Yang, Antoine, Nagrani, Arsha, Seo, Paul Hongsuck, Miech, Antoine, Pont-Tuset, Jordi, Laptev, Ivan, Sivic, Josef, Schmid, Cordelia
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it t
Externí odkaz:
http://arxiv.org/abs/2302.14115
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB imag
Externí odkaz:
http://arxiv.org/abs/2211.09966