Zobrazeno 1 - 10
of 21
pro vyhledávání: '"Ayyubi, Hammad"'
Autor:
Wang, Zhecan, Liu, Junzhang, Tang, Chia-Wei, Alomari, Hani, Sivakumar, Anushka, Sun, Rui, Li, Wenhao, Atabuzzaman, Md., Ayyubi, Hammad, You, Haoxuan, Ishmam, Alvi, Chang, Kai-Wei, Chang, Shih-Fu, Thomas, Chris
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on backgrou
Externí odkaz:
http://arxiv.org/abs/2409.12953
Autor:
Liu, Junzhang, Wang, Zhecan, Ayyubi, Hammad, You, Haoxuan, Thomas, Chris, Sun, Rui, Chang, Shih-Fu, Chang, Kai-Wei
Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples wher
Externí odkaz:
http://arxiv.org/abs/2405.11145
Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1
Externí odkaz:
http://arxiv.org/abs/2403.18600
Autor:
Ayyubi, Hammad A., Liu, Tianqi, Nagrani, Arsha, Lin, Xudong, Zhang, Mingda, Arnab, Anurag, Han, Feng, Zhu, Yukun, Liu, Jialu, Chang, Shih-Fu
Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities
Externí odkaz:
http://arxiv.org/abs/2312.02188
Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts (nouns) from
Externí odkaz:
http://arxiv.org/abs/2305.17540
Autor:
You, Haoxuan, Sun, Rui, Wang, Zhecan, Chen, Long, Wang, Gengyu, Ayyubi, Hammad A., Chang, Kai-Wei, Chang, Shih-Fu
The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this g
Externí odkaz:
http://arxiv.org/abs/2305.14985
Autor:
Chen, Long, Niu, Yulei, Chen, Brian, Lin, Xudong, Han, Guangxing, Thomas, Christopher, Ayyubi, Hammad, Ji, Heng, Chang, Shih-Fu
Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can
Externí odkaz:
http://arxiv.org/abs/2210.12444
Autor:
Ayyubi, Hammad A., Thomas, Christopher, Chum, Lovish, Lokesh, Rahul, Chen, Long, Niu, Yulei, Lin, Xudong, Feng, Xuande, Koo, Jaywon, Ray, Sounak, Chang, Shih-Fu
Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across
Externí odkaz:
http://arxiv.org/abs/2206.07207
Despite recent advances in Visual QuestionAnswering (VQA), it remains a challenge todetermine how much success can be attributedto sound reasoning and comprehension ability.We seek to investigate this question by propos-ing a new task ofrationale gen
Externí odkaz:
http://arxiv.org/abs/2004.02032
Publikováno v:
ICLR Workshop on Neural Networks and Differential Equations, 2020
Neural Ordinary Differential Equations (NODEs) have proven to be a powerful modeling tool for approximating (interpolation) and forecasting (extrapolation) irregularly sampled time series data. However, their performance degrades substantially when a
Externí odkaz:
http://arxiv.org/abs/2003.03695