Zobrazeno 1 - 10
of 1 726
pro vyhledávání: '"Shao, Rui"'
Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute t
Externí odkaz:
http://arxiv.org/abs/2408.03615
Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens w
Externí odkaz:
http://arxiv.org/abs/2407.14439
Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to t
Externí odkaz:
http://arxiv.org/abs/2407.12709
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
Publikováno v:
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:33558-33574, 2024
Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts
Externí odkaz:
http://arxiv.org/abs/2404.04929
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these respo
Externí odkaz:
http://arxiv.org/abs/2403.04640
Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks, thereby piquing the research community's curiosity for exploring their potential in emotional intelligence. However, several issues in the field of e
Externí odkaz:
http://arxiv.org/abs/2401.06836
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to i
Externí odkaz:
http://arxiv.org/abs/2311.11860
Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing metho
Externí odkaz:
http://arxiv.org/abs/2309.14991
Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality f
Externí odkaz:
http://arxiv.org/abs/2309.14203
Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generaliz
Externí odkaz:
http://arxiv.org/abs/2306.00863