Zobrazeno 1 - 10
of 23
pro vyhledávání: '"Lu, Yumao"'
Autor:
Xiao, Bin, Wu, Haiping, Xu, Weijian, Dai, Xiyang, Hu, Houdong, Lu, Yumao, Zeng, Michael, Liu, Ce, Yuan, Lu
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a
Externí odkaz:
http://arxiv.org/abs/2311.06242
Autor:
Lin, Kevin, Ahmed, Faisal, Li, Linjie, Lin, Chung-Ching, Azarnasab, Ehsan, Yang, Zhengyuan, Wang, Jianfeng, Liang, Lin, Liu, Zicheng, Lu, Yumao, Liu, Ce, Wang, Lijuan
We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-fo
Externí odkaz:
http://arxiv.org/abs/2310.19773
Autor:
Lin, Kevin, Li, Linjie, Lin, Chung-Ching, Ahmed, Faisal, Gan, Zhe, Liu, Zicheng, Lu, Yumao, Wang, Lijuan
The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image
Externí odkaz:
http://arxiv.org/abs/2111.13196
Autor:
Hu, Xiaowei, Gan, Zhe, Wang, Jianfeng, Yang, Zhengyuan, Liu, Zicheng, Lu, Yumao, Wang, Lijuan
In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-t
Externí odkaz:
http://arxiv.org/abs/2111.12233
Autor:
Yang, Zhengyuan, Gan, Zhe, Wang, Jianfeng, Hu, Xiaowei, Ahmed, Faisal, Liu, Zicheng, Lu, Yumao, Wang, Lijuan
We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve t
Externí odkaz:
http://arxiv.org/abs/2111.12085
Autor:
Yuan, Lu, Chen, Dongdong, Chen, Yi-Ling, Codella, Noel, Dai, Xiyang, Gao, Jianfeng, Hu, Houdong, Huang, Xuedong, Li, Boxin, Li, Chunyuan, Liu, Ce, Liu, Mengchen, Liu, Zicheng, Lu, Yumao, Shi, Yu, Wang, Lijuan, Wang, Jianfeng, Xiao, Bin, Xiao, Zhen, Yang, Jianwei, Zeng, Michael, Zhou, Luowei, Zhang, Pengchuan
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, la
Externí odkaz:
http://arxiv.org/abs/2111.11432
Autor:
Wang, Jianfeng, Hu, Xiaowei, Gan, Zhe, Yang, Zhengyuan, Dai, Xiyang, Liu, Zicheng, Lu, Yumao, Wang, Lijuan
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) repre
Externí odkaz:
http://arxiv.org/abs/2111.10023
Autor:
Yang, Zhengyuan, Gan, Zhe, Wang, Jianfeng, Hu, Xiaowei, Lu, Yumao, Liu, Zicheng, Wang, Lijuan
Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the inp
Externí odkaz:
http://arxiv.org/abs/2109.05014
Autor:
Lu, Yumao
Publikováno v:
Restricted to subscribing institutions.
Thesis (Ph. D.)--UCLA, 2005.
Vita. Includes bibliographical references (leaves 122-130).
Vita. Includes bibliographical references (leaves 122-130).
Externí odkaz:
http://uclibs.org/PID/11984
Publikováno v:
Proceeding of the 18th ACM Conference: Information & Knowledge Management; 11/ 2/2009, p1585-1588, 4p