Výsledky vyhledávání - "Mohammed, Owais Khan"

Report

ArK: Augmented Reality with Knowledge Interactive Emergent Ability

Autor: Huang, Qiuyuan, Park, Jae Sung, Gupta, Abhinav, Bennett, Paul, Gong, Ran, Som, Subhojit, Peng, Baolin, Mohammed, Owais Khan, Pal, Chris, Choi, Yejin, Gao, Jianfeng

Despite the growing adoption of mixed reality and interactive AI agents, it remains challenging for these systems to generate high quality 2D/3D scenes in unseen environments. The common practice requires deploying an AI agent to collect large amount

Externí odkaz: http://arxiv.org/abs/2305.00970

Zobrazit plný text záznamu

Report

Language Is Not All You Need: Aligning Perception with Language Models

Autor: Huang, Shaohan, Dong, Li, Wang, Wenhui, Hao, Yaru, Singhal, Saksham, Ma, Shuming, Lv, Tengchao, Cui, Lei, Mohammed, Owais Khan, Patra, Barun, Liu, Qiang, Aggarwal, Kriti, Chi, Zewen, Bjorck, Johan, Chaudhary, Vishrav, Som, Subhojit, Song, Xia, Wei, Furu

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities,

Externí odkaz: http://arxiv.org/abs/2302.14045

Zobrazit plný text záznamu

Report

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Autor: Wang, Wenhui, Bao, Hangbo, Dong, Li, Bjorck, Johan, Peng, Zhiliang, Liu, Qiang, Aggarwal, Kriti, Mohammed, Owais Khan, Singhal, Saksham, Som, Subhojit, Wei, Furu

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language ta

Externí odkaz: http://arxiv.org/abs/2208.10442

Zobrazit plný text záznamu

Report

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Autor: Bao, Hangbo, Wang, Wenhui, Dong, Li, Liu, Qiang, Mohammed, Owais Khan, Aggarwal, Kriti, Som, Subhojit, Wei, Furu

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block conta

Externí odkaz: http://arxiv.org/abs/2111.02358

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání