Zobrazeno 1 - 10
of 210
pro vyhledávání: '"Huang, Shiyuan"'
Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted
Externí odkaz:
http://arxiv.org/abs/2405.18405
Autor:
Huang, Shiyuan
Recent advances in deep learning models have shown impressive capabilities in various computer vision tasks, which encourages the integration of these models into real-world vision systems such as smart devices. This integration presents new challeng
Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are in
Externí odkaz:
http://arxiv.org/abs/2310.11207
Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a f
Externí odkaz:
http://arxiv.org/abs/2303.15466
Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class
Externí odkaz:
http://arxiv.org/abs/2303.09674
Autor:
Yang, Yuncong, Ma, Jiawei, Huang, Shiyuan, Chen, Long, Lin, Xudong, Han, Guangxing, Chang, Shih-Fu
Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description whe
Externí odkaz:
http://arxiv.org/abs/2212.13738
In Video Question Answering (VideoQA), answering general questions about a video requires its visual information. Yet, video often contains redundant information irrelevant to the VideoQA task. For example, if the task is only to answer questions sim
Externí odkaz:
http://arxiv.org/abs/2210.08391
Autor:
Lin, Xudong, Tiwari, Simran, Huang, Shiyuan, Li, Manling, Shou, Mike Zheng, Ji, Heng, Chang, Shih-Fu
Multi-channel video-language retrieval require models to understand information from different channels (e.g. video$+$question, video$+$speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are
Externí odkaz:
http://arxiv.org/abs/2206.02082
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD
Externí odkaz:
http://arxiv.org/abs/2204.07841
Few-shot object detection (FSOD), with the aim to detect novel objects using very few training examples, has recently attracted great research interest in the community. Metric-learning based methods have been demonstrated to be effective for this ta
Externí odkaz:
http://arxiv.org/abs/2203.15021