Zobrazeno 1 - 10
of 1 621
pro vyhledávání: '"multimodal large language models"'
Autor:
Wu, Junda, Lyu, Hanjia, Xia, Yu, Zhang, Zhehao, Barrow, Joe, Kumar, Ishita, Mirtaheri, Mehrnoosh, Chen, Hongjie, Rossi, Ryan A., Dernoncourt, Franck, Yu, Tong, Zhang, Ruiyi, Gu, Jiuxiang, Ahmed, Nesreen K., Wang, Yu, Chen, Xiang, Deilamsalehy, Hanieh, Park, Namyong, Kim, Sungchul, Yang, Huanrui, Mitra, Subrata, Hu, Zhengmian, Lipka, Nedim, Nguyen, Dang, Zhao, Yue, Luo, Jiebo, McAuley, Julian
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. Thi
Externí odkaz:
http://arxiv.org/abs/2412.02142
Autor:
Yan, Ziang, Li, Zhilin, He, Yinan, Wang, Chenting, Li, Kunchang, Li, Xinhao, Zeng, Xiangyu, Wang, Zilei, Wang, Yali, Qiao, Yu, Wang, Limin, Wang, Yi
Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or
Externí odkaz:
http://arxiv.org/abs/2412.19326
While Multimodal Large Language Models (MLLMs) have made remarkable progress in vision-language reasoning, they are also more susceptible to producing harmful content compared to models that focus solely on text. Existing defensive prompting techniqu
Externí odkaz:
http://arxiv.org/abs/2412.18826
Autor:
Wu, Mengyang, Zhao, Yuzhi, Cao, Jialun, Xu, Mingjie, Jiang, Zhongming, Wang, Xuehui, Li, Qinbin, Hu, Guangneng, Qin, Shengchao, Fu, Chi-Wing
Controversial contents largely inundate the Internet, infringing various cultural norms and child protection standards. Traditional Image Content Moderation (ICM) models fall short in producing precise moderation decisions for diverse standards, whil
Externí odkaz:
http://arxiv.org/abs/2412.18216
Remote-sensing mineral exploration is critical for identifying economically viable mineral deposits, yet it poses significant challenges for multimodal large language models (MLLMs). These include limitations in domain-specific geological knowledge a
Externí odkaz:
http://arxiv.org/abs/2412.17339
Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they gen
Externí odkaz:
http://arxiv.org/abs/2412.15650
Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and auton
Externí odkaz:
http://arxiv.org/abs/2412.14660
Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a nov
Externí odkaz:
http://arxiv.org/abs/2412.14171
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task typ
Externí odkaz:
http://arxiv.org/abs/2412.13050
Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurat
Externí odkaz:
http://arxiv.org/abs/2412.10840