Zobrazeno 1 - 10
of 241
pro vyhledávání: '"Xue,Hongwei"'
Autor:
Ma, Feipeng, Xue, Hongwei, Wang, Guangting, Zhou, Yizhou, Rao, Fengyun, Yan, Shilin, Zhang, Yueyi, Wu, Siying, Shou, Mike Zheng, Sun, Xiaoyan
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unifi
Externí odkaz:
http://arxiv.org/abs/2405.20339
Autor:
Ma, Feipeng, Xue, Hongwei, Wang, Guangting, Zhou, Yizhou, Rao, Fengyun, Yan, Shilin, Zhang, Yueyi, Wu, Siying, Shou, Mike Zheng, Sun, Xiaoyan
Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding.
Externí odkaz:
http://arxiv.org/abs/2405.19333
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. By reconstructing masked image patches from a small portion of visible image regions, MAE forces the model to infer semantic correlation with
Externí odkaz:
http://arxiv.org/abs/2211.08887
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-fo
Externí odkaz:
http://arxiv.org/abs/2210.06031
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer
Externí odkaz:
http://arxiv.org/abs/2209.06430
Publikováno v:
In Sensors and Actuators: B. Chemical 1 November 2024 418
Autor:
Xue, Hongwei, Hang, Tiankai, Zeng, Yanhong, Sun, Yuchong, Liu, Bei, Yang, Huan, Fu, Jianlong, Guo, Baining
Publikováno v:
published in CVPR 2022
We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-reso
Externí odkaz:
http://arxiv.org/abs/2111.10337
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work,
Externí odkaz:
http://arxiv.org/abs/2110.09753
Autor:
Zinzani, Pier Luigi, Wang, Huaqing, Feng, Jifeng, Kim, Tae Min, Tao, Rong, Zhang, Huilai, Fogliatto, Laura, Maluquer Artigal, Clara, Özcan, Muhit, Yanez, Eduardo, Kim, Won Seog, Kirtbaya, Dmitry, Kriachok, Iryna, Maciel, Felipe, Xue, Hongwei, Bouabdallah, Krimo, Phelps, Charles, Chaturvedi, Shalini, Weispfenning, Anke, Morcos, Peter N., Odongo, Fatuma, Buvaylo, Viktoriya, Childs, Barrett H., Dreyling, Martin, Matasar, Matthew, Ghione, Paola
Publikováno v:
In Blood Advances 24 September 2024 8(18):4866-4876
In this paper we focus on landscape animation, which aims to generate time-lapse videos from a single landscape image. Motion is crucial for landscape animation as it determines how objects move in videos. Existing methods are able to generate appeal
Externí odkaz:
http://arxiv.org/abs/2109.02216