Zobrazeno 1 - 10
of 131
pro vyhledávání: '"Huang, Shijia"'
Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering. However, most existing Video LLMs neglect temporal information in video data, leading to struggles with tempo
Externí odkaz:
http://arxiv.org/abs/2410.05714
Autor:
Zhang, Hao, Li, Hongyang, Li, Feng, Ren, Tianhe, Zou, Xueyan, Liu, Shilong, Huang, Shijia, Gao, Jianfeng, Zhang, Lei, Li, Chunyuan, Yang, Jianwei
With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for gr
Externí odkaz:
http://arxiv.org/abs/2312.02949
Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite t
Externí odkaz:
http://arxiv.org/abs/2312.02010
Autor:
Li, Yanyang, Zhao, Jianqiao, Zheng, Duo, Hu, Zi-Yuan, Chen, Zhi, Su, Xiaohui, Huang, Yongfeng, Huang, Shijia, Lin, Dahua, Lyu, Michael R., Wang, Liwei
With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performanc
Externí odkaz:
http://arxiv.org/abs/2308.04813
We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, w
Externí odkaz:
http://arxiv.org/abs/2303.07336
Autor:
Liu, Shilong, Liang, Yaoyuan, Li, Feng, Huang, Shijia, Zhang, Hao, Su, Hang, Zhu, Jun, Zhang, Lei
In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from ima
Externí odkaz:
http://arxiv.org/abs/2211.15516
Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each
Externí odkaz:
http://arxiv.org/abs/2211.07919
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors. We first revisit the prior stereo detector DSGN for its stereo volume construction ways for representing both 3D geometry and semantics. We
Externí odkaz:
http://arxiv.org/abs/2204.03039
The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The vision-language corres
Externí odkaz:
http://arxiv.org/abs/2204.02174
Autor:
Ma, Xinghua, Ligan, Caryl, Huang, Shijia, Chen, Yirong, Li, Muxin, Cao, Yuanyuan, Zhao, Wei, Zhao, Shuli
Publikováno v:
In Immunobiology September 2024 229(5)