Výsledky vyhledávání - "Shou, Mike Zheng"

Report

GUI Action Narrator: Where and When Did That Action Take Place?

Autor: Wu, Qinchen, Gao, Difei, Lin, Kevin Qinghong, Wu, Zhuoyu, Guo, Xiangwu, Li, Peiran, Zhang, Weichen, Wang, Hengxu, Shou, Mike Zheng

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understand

Externí odkaz: http://arxiv.org/abs/2406.13719

Zobrazit plný text záznamu

Report

VideoLLM-online: Online Video Large Language Model for Streaming Video

Autor: Chen, Joya, Lv, Zhaoyang, Wu, Shiwei, Lin, Kevin Qinghong, Song, Chenan, Gao, Difei, Liu, Jia-Wei, Gao, Ziteng, Mao, Dongxing, Shou, Mike Zheng

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as pr

Externí odkaz: http://arxiv.org/abs/2406.11816

Zobrazit plný text záznamu

Report

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Autor: Lin, Kevin Qinghong, Li, Linjie, Gao, Difei, WU, Qinchen, Yan, Mingyi, Yang, Zhengyuan, Wang, Lijuan, Shou, Mike Zheng

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruc

Externí odkaz: http://arxiv.org/abs/2406.10227

Zobrazit plný text záznamu

Report

Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

Autor: Yang, Pei, Ci, Hai, Song, Yiren, Shou, Mike Zheng

Digital watermarking techniques are crucial for copyright protection and source identification of images, especially in the era of generative AI models. However, many existing watermarking methods, particularly content-agnostic approaches that embed

Externí odkaz: http://arxiv.org/abs/2406.09026

Zobrazit plný text záznamu

Report

WMAdapter: Adding WaterMark Control to Latent Diffusion Models

Autor: Ci, Hai, Song, Yiren, Yang, Pei, Xie, Jinheng, Shou, Mike Zheng

Watermarking is crucial for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that takes user-specified watermark information and allows for seamless watermark imprinting during the diffusion ge

Externí odkaz: http://arxiv.org/abs/2406.08337

Zobrazit plný text záznamu

Report

ProcessPainter: Learn Painting Process from Sequence Data

Autor: Song, Yiren, Huang, Shijie, Yao, Chen, Ye, Xiaojun, Ci, Hai, Liu, Jiaming, Zhang, Yuxuan, Shou, Mike Zheng

The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored.

Externí odkaz: http://arxiv.org/abs/2406.06062

Zobrazit plný text záznamu

Report

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Autor: Wang, Alex Jinpeng, Li, Linjie, Lin, Yiqi, Li, Min, Wang, Lijuan, Shou, Mike Zheng

Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative

Externí odkaz: http://arxiv.org/abs/2406.02547

Zobrazit plný text záznamu

Report

Visual Perception by Large Language Model's Weights

Autor: Ma, Feipeng, Xue, Hongwei, Wang, Guangting, Zhou, Yizhou, Rao, Fengyun, Yan, Shilin, Zhang, Yueyi, Wu, Siying, Shou, Mike Zheng, Sun, Xiaoyan

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unifi

Externí odkaz: http://arxiv.org/abs/2405.20339

Zobrazit plný text záznamu

Report

Multi-Modal Generative Embedding Model

Autor: Ma, Feipeng, Xue, Hongwei, Wang, Guangting, Zhou, Yizhou, Rao, Fengyun, Yan, Shilin, Zhang, Yueyi, Wu, Siying, Shou, Mike Zheng, Sun, Xiaoyan

Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding.

Externí odkaz: http://arxiv.org/abs/2405.19333

Zobrazit plný text záznamu

Report

LOVA3: Learning to Visual Question Answering, Asking and Assessment

Autor: Zhao, Henry Hengyuan, Zhou, Pan, Gao, Difei, Shou, Mike Zheng

Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learn

Externí odkaz: http://arxiv.org/abs/2405.14974

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání