Zobrazeno 1 - 10
of 518
pro vyhledávání: '"Lin Xudong"'
Autor:
Su, Hung-Ting, Hsu, Ya-Ching, Lin, Xudong, Shi, Xiang-Qian, Niu, Yulei, Hsu, Han-Yuan, Lee, Hung-yi, Hsu, Winston H.
Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, whic
Externí odkaz:
http://arxiv.org/abs/2409.14324
Autor:
Lee, Jinhyuk, Chen, Anthony, Dai, Zhuyun, Dua, Dheeru, Sachan, Devendra Singh, Boratko, Michael, Luan, Yi, Arnold, Sébastien M. R., Perot, Vincent, Dalmia, Siddharth, Hu, Hexiang, Lin, Xudong, Pasupat, Panupong, Amini, Aida, Cole, Jeremy R., Riedel, Sebastian, Naim, Iftekhar, Chang, Ming-Wei, Guu, Kelvin
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of inf
Externí odkaz:
http://arxiv.org/abs/2406.13121
Autor:
Su, Hung-Ting, Chao, Chun-Tong, Hsu, Ya-Ching, Lin, Xudong, Niu, Yulei, Lee, Hung-Yi, Hsu, Winston H.
Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlo
Externí odkaz:
http://arxiv.org/abs/2406.10923
Autor:
Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A., Ma, Wei-Chiu, Krishna, Ranjay
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimati
Externí odkaz:
http://arxiv.org/abs/2404.12390
We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and a
Externí odkaz:
http://arxiv.org/abs/2403.01599
Autor:
Ayyubi, Hammad A., Liu, Tianqi, Nagrani, Arsha, Lin, Xudong, Zhang, Mingda, Arnab, Anurag, Han, Feng, Zhu, Yukun, Liu, Jialu, Chang, Shih-Fu
Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities
Externí odkaz:
http://arxiv.org/abs/2312.02188
Autor:
Han, Xiaotian, You, Quanzeng, Liu, Yongfei, Chen, Wentao, Zheng, Huangjie, Mrini, Khalil, Lin, Xudong, Wang, Yiqi, Zhai, Bohan, Yuan, Jianbo, Wang, Heng, Yang, Hongxia
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal ben
Externí odkaz:
http://arxiv.org/abs/2311.11567
Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the f
Externí odkaz:
http://arxiv.org/abs/2305.17542
Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with
Externí odkaz:
http://arxiv.org/abs/2304.03754
Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a f
Externí odkaz:
http://arxiv.org/abs/2303.15466