Výsledky vyhledávání

Report

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

Autor: Yang, Zhen, Chen, Jinhao, Du, Zhengxiao, Yu, Wenmeng, Wang, Weihan, Hong, Wenyi, Jiang, Zhihuan, Xu, Bin, Dong, Yuxiao, Tang, Jie

Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathema

Externí odkaz: http://arxiv.org/abs/2409.13729

Zobrazit plný text záznamu

Report

CogVLM2: Visual Language Models for Image and Video Understanding

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new genera

Externí odkaz: http://arxiv.org/abs/2408.16500

Zobrazit plný text záznamu

Report

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, pote

Externí odkaz: http://arxiv.org/abs/2408.06327

Zobrazit plný text záznamu

Report

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Autor: Yang, Zhuoyi, Teng, Jiayan, Zheng, Wendi, Ding, Ming, Huang, Shiyu, Xu, Jiazheng, Yang, Yuanming, Hong, Wenyi, Zhang, Xiaohan, Feng, Guanyu, Yin, Da, Gu, Xiaotao, Zhang, Yuxuan, Wang, Weihan, Cheng, Yean, Liu, Ting, Xu, Bin, Dong, Yuxiao, Tang, Jie

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous vide

Externí odkaz: http://arxiv.org/abs/2408.06072

Zobrazit plný text záznamu

Report

LVBench: An Extreme Long Video Understanding Benchmark

Autor: Wang, Weihan, He, Zehai, Hong, Wenyi, Cheng, Yean, Zhang, Xiaohan, Qi, Ji, Gu, Xiaotao, Huang, Shiyu, Xu, Bin, Dong, Yuxiao, Ding, Ming, Tang, Jie

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the

Externí odkaz: http://arxiv.org/abs/2406.08035

Zobrazit plný text záznamu

Report

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Autor: Yang, Zhuoyi, Jiang, Heyang, Hong, Wenyi, Teng, Jiayan, Zheng, Wendi, Dong, Yuxiao, Ding, Ming, Tang, Jie

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limite

Externí odkaz: http://arxiv.org/abs/2405.04312

Zobrazit plný text záznamu

Report

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Autor: Qi, Ji, Ding, Ming, Wang, Weihan, Bai, Yushi, Lv, Qingsong, Hong, Wenyi, Xu, Bin, Hou, Lei, Li, Juanzi, Dong, Yuxiao, Tang, Jie

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, fur

Externí odkaz: http://arxiv.org/abs/2402.04236

Zobrazit plný text záznamu

Report

CogAgent: A Visual Language Model for GUI Agents

Autor: Hong, Wenyi, Wang, Weihan, Lv, Qingsong, Xu, Jiazheng, Yu, Wenmeng, Ji, Junhui, Wang, Yan, Wang, Zihan, Zhang, Yuxuan, Li, Juanzi, Xu, Bin, Dong, Yuxiao, Ding, Ming, Tang, Jie

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggl

Externí odkaz: http://arxiv.org/abs/2312.08914

Zobrazit plný text záznamu

Report

CogVLM: Visual Expert for Pretrained Language Models

Autor: Wang, Weihan, Lv, Qingsong, Yu, Wenmeng, Hong, Wenyi, Qi, Ji, Wang, Yan, Ji, Junhui, Yang, Zhuoyi, Zhao, Lei, Song, Xixuan, Xu, Jiazheng, Xu, Bin, Li, Juanzi, Dong, Yuxiao, Ding, Ming, Tang, Jie

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained l

Externí odkaz: http://arxiv.org/abs/2311.03079

Zobrazit plný text záznamu

Report

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

Autor: Teng, Jiayan, Zheng, Wendi, Ding, Ming, Hong, Wenyi, Wangni, Jianqiao, Yang, Zhuoyi, Tang, Jie

Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution

Externí odkaz: http://arxiv.org/abs/2309.03350

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání