Zobrazeno 1 - 10
of 168
pro vyhledávání: '"You, Haoxuan"'
Autor:
Ye, Hanrong, Zhang, Haotian, Daxberger, Erik, Chen, Lin, Lin, Zongyu, Li, Yanghao, Zhang, Bowen, You, Haoxuan, Xu, Dan, Gan, Zhe, Lu, Jiasen, Yang, Yinfei
This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develo
Externí odkaz:
http://arxiv.org/abs/2410.07177
Autor:
Zhang, Haotian, Gao, Mingfei, Gan, Zhe, Dufter, Philipp, Wenzel, Nina, Huang, Forrest, Shah, Dhruti, Du, Xianzhi, Zhang, Bowen, Li, Yanghao, Dodge, Sam, You, Keen, Yang, Zhen, Timofeev, Aleksei, Xu, Mingze, Chen, Hong-You, Fauconnier, Jean-Philippe, Lai, Zhengfeng, You, Haoxuan, Wang, Zirui, Dehghan, Afshin, Grasch, Peter, Yang, Yinfei
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts
Externí odkaz:
http://arxiv.org/abs/2409.20566
Autor:
Wang, Zhecan, Liu, Junzhang, Tang, Chia-Wei, Alomari, Hani, Sivakumar, Anushka, Sun, Rui, Li, Wenhao, Atabuzzaman, Md., Ayyubi, Hammad, You, Haoxuan, Ishmam, Alvi, Chang, Kai-Wei, Chang, Shih-Fu, Thomas, Chris
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on backgrou
Externí odkaz:
http://arxiv.org/abs/2409.12953
Autor:
Liu, Junzhang, Wang, Zhecan, Ayyubi, Hammad, You, Haoxuan, Thomas, Chris, Sun, Rui, Chang, Shih-Fu, Chang, Kai-Wei
Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples wher
Externí odkaz:
http://arxiv.org/abs/2405.11145
Autor:
Zhang, Haotian, You, Haoxuan, Dufter, Philipp, Zhang, Bowen, Chen, Chen, Chen, Hong-You, Fu, Tsu-Jui, Wang, William Yang, Chang, Shih-Fu, Gan, Zhe, Yang, Yinfei
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perfor
Externí odkaz:
http://arxiv.org/abs/2404.07973
Autor:
Nie, Jingping, Shao, Hanya, Fan, Yuang, Shao, Qijia, You, Haoxuan, Preindl, Matthias, Jiang, Xiaofan
Despite the global mental health crisis, access to screenings, professionals, and treatments remains high. In collaboration with licensed psychotherapists, we propose a Conversational AI Therapist with psychotherapeutic Interventions (CaiTI), a platf
Externí odkaz:
http://arxiv.org/abs/2403.10779
Autor:
Wang, Zhecan, Chen, Long, You, Haoxuan, Xu, Keyang, He, Yicheng, Li, Wenhao, Codella, Noel, Chang, Kai-Wei, Chang, Shih-Fu
Publikováno v:
EMNLP 2023
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions. However, we have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correc
Externí odkaz:
http://arxiv.org/abs/2310.14670
Autor:
You, Haoxuan, Zhang, Haotian, Gan, Zhe, Du, Xianzhi, Zhang, Bowen, Wang, Zirui, Cao, Liangliang, Chang, Shih-Fu, Yang, Yinfei
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LL
Externí odkaz:
http://arxiv.org/abs/2310.07704
Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have be
Externí odkaz:
http://arxiv.org/abs/2307.00862
Autor:
You, Haoxuan, Sun, Rui, Wang, Zhecan, Chen, Long, Wang, Gengyu, Ayyubi, Hammad A., Chang, Kai-Wei, Chang, Shih-Fu
The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this g
Externí odkaz:
http://arxiv.org/abs/2305.14985