Výsledky vyhledávání

Report

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Autor: Li, Zaijing, Xie, Yuquan, Shao, Rui, Chen, Gongwei, Jiang, Dongmei, Nie, Liqiang

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute t

Externí odkaz: http://arxiv.org/abs/2408.03615

Zobrazit plný text záznamu

Report

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Autor: Zhang, Renshan, Lyu, Yibo, Shao, Rui, Chen, Gongwei, Guan, Weili, Nie, Liqiang

Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens w

Externí odkaz: http://arxiv.org/abs/2407.14439

Zobrazit plný text záznamu

Report

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Autor: Shen, Leyang, Chen, Gongwei, Shao, Rui, Guan, Weili, Nie, Liqiang

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to t

Externí odkaz: http://arxiv.org/abs/2407.12709

Zobrazit plný text záznamu

Report

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

Autor: Lv, Qi, Li, Hao, Deng, Xiang, Shao, Rui, Wang, Michael Yu, Nie, Liqiang

Publikováno v: Proceedings of the 41st International Conference on Machine Learning, PMLR 235:33558-33574, 2024

Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts

Externí odkaz: http://arxiv.org/abs/2404.04929

Zobrazit plný text záznamu

Report

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Autor: Ye, Qilang, Yu, Zitong, Shao, Rui, Xie, Xinyu, Torr, Philip, Cao, Xiaochun

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these respo

Externí odkaz: http://arxiv.org/abs/2403.04640

Zobrazit plný text záznamu

Report

Enhancing Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought

Autor: Li, Zaijing, Chen, Gongwei, Shao, Rui, Xie, Yuquan, Jiang, Dongmei, Nie, Liqiang

Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks, thereby piquing the research community's curiosity for exploring their potential in emotional intelligence. However, several issues in the field of e

Externí odkaz: http://arxiv.org/abs/2401.06836

Zobrazit plný text záznamu

Report

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Autor: Chen, Gongwei, Shen, Leyang, Shao, Rui, Deng, Xiang, Nie, Liqiang

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to i

Externí odkaz: http://arxiv.org/abs/2311.11860

Zobrazit plný text záznamu

Report

Robust Sequential DeepFake Detection

Autor: Shao, Rui, Wu, Tianxing, Liu, Ziwei

Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing metho

Externí odkaz: http://arxiv.org/abs/2309.14991

Zobrazit plný text záznamu

Report

Detecting and Grounding Multi-Modal Media Manipulation and Beyond

Autor: Shao, Rui, Wu, Tianxing, Wu, Jianlong, Nie, Liqiang, Liu, Ziwei

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality f

Externí odkaz: http://arxiv.org/abs/2309.14203

Zobrazit plný text záznamu

Report

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection

Autor: Shao, Rui, Wu, Tianxing, Nie, Liqiang, Liu, Ziwei

Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generaliz

Externí odkaz: http://arxiv.org/abs/2306.00863

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání