Zobrazeno 1 - 10
of 168
pro vyhledávání: '"Shou, Mike Zheng"'
Autor:
Wu, Qinchen, Gao, Difei, Lin, Kevin Qinghong, Wu, Zhuoyu, Guo, Xiangwu, Li, Peiran, Zhang, Weichen, Wang, Hengxu, Shou, Mike Zheng
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understand
Externí odkaz:
http://arxiv.org/abs/2406.13719
Autor:
Chen, Joya, Lv, Zhaoyang, Wu, Shiwei, Lin, Kevin Qinghong, Song, Chenan, Gao, Difei, Liu, Jia-Wei, Gao, Ziteng, Mao, Dongxing, Shou, Mike Zheng
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as pr
Externí odkaz:
http://arxiv.org/abs/2406.11816
Autor:
Lin, Kevin Qinghong, Li, Linjie, Gao, Difei, WU, Qinchen, Yan, Mingyi, Yang, Zhengyuan, Wang, Lijuan, Shou, Mike Zheng
Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruc
Externí odkaz:
http://arxiv.org/abs/2406.10227
Digital watermarking techniques are crucial for copyright protection and source identification of images, especially in the era of generative AI models. However, many existing watermarking methods, particularly content-agnostic approaches that embed
Externí odkaz:
http://arxiv.org/abs/2406.09026
Watermarking is crucial for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that takes user-specified watermark information and allows for seamless watermark imprinting during the diffusion ge
Externí odkaz:
http://arxiv.org/abs/2406.08337
Autor:
Song, Yiren, Huang, Shijie, Yao, Chen, Ye, Xiaojun, Ci, Hai, Liu, Jiaming, Zhang, Yuxuan, Shou, Mike Zheng
The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored.
Externí odkaz:
http://arxiv.org/abs/2406.06062
Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative
Externí odkaz:
http://arxiv.org/abs/2406.02547
Autor:
Ma, Feipeng, Xue, Hongwei, Wang, Guangting, Zhou, Yizhou, Rao, Fengyun, Yan, Shilin, Zhang, Yueyi, Wu, Siying, Shou, Mike Zheng, Sun, Xiaoyan
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unifi
Externí odkaz:
http://arxiv.org/abs/2405.20339
Autor:
Ma, Feipeng, Xue, Hongwei, Wang, Guangting, Zhou, Yizhou, Rao, Fengyun, Yan, Shilin, Zhang, Yueyi, Wu, Siying, Shou, Mike Zheng, Sun, Xiaoyan
Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding.
Externí odkaz:
http://arxiv.org/abs/2405.19333
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learn
Externí odkaz:
http://arxiv.org/abs/2405.14974