Výsledky vyhledávání - "Huang, Shijia"

Report

Enhancing Temporal Modeling of Video LLMs via Time Gating

Autor: Hu, Zi-Yuan, Zhong, Yiwu, Huang, Shijia, Lyu, Michael R., Wang, Liwei

Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering. However, most existing Video LLMs neglect temporal information in video data, leading to struggles with tempo

Externí odkaz: http://arxiv.org/abs/2410.05714

Zobrazit plný text záznamu

Report

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Autor: Zhang, Hao, Li, Hongyang, Li, Feng, Ren, Tianhe, Zou, Xueyan, Liu, Shilong, Huang, Shijia, Gao, Jianfeng, Zhang, Lei, Li, Chunyuan, Yang, Jianwei

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for gr

Externí odkaz: http://arxiv.org/abs/2312.02949

Zobrazit plný text záznamu

Report

Towards Learning a Generalist Model for Embodied Navigation

Autor: Zheng, Duo, Huang, Shijia, Zhao, Lin, Zhong, Yiwu, Wang, Liwei

Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite t

Externí odkaz: http://arxiv.org/abs/2312.02010

Zobrazit plný text záznamu

Report

CLEVA: Chinese Language Models EVAluation Platform

Autor: Li, Yanyang, Zhao, Jianqiao, Zheng, Duo, Hu, Zi-Yuan, Chen, Zhi, Su, Xiaohui, Huang, Yongfeng, Huang, Shijia, Lin, Dahua, Lyu, Michael R., Wang, Liwei

With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performanc

Externí odkaz: http://arxiv.org/abs/2308.04813

Zobrazit plný text záznamu

Report

MP-Former: Mask-Piloted Transformer for Image Segmentation

Autor: Zhang, Hao, Li, Feng, Xu, Huaizhe, Huang, Shijia, Liu, Shilong, Ni, Lionel M., Zhang, Lei

We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, w

Externí odkaz: http://arxiv.org/abs/2303.07336

Zobrazit plný text záznamu

Report

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Autor: Liu, Shilong, Liang, Yaoyuan, Li, Feng, Huang, Shijia, Zhang, Hao, Su, Hang, Zhu, Jun, Zhang, Lei

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from ima

Externí odkaz: http://arxiv.org/abs/2211.15516

Zobrazit plný text záznamu

Report

A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation

Autor: Huang, Shijia, Li, Feng, Zhang, Hao, Liu, Shilong, Zhang, Lei, Wang, Liwei

Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each

Externí odkaz: http://arxiv.org/abs/2211.07919

Zobrazit plný text záznamu

Report

DSGN++: Exploiting Visual-Spatial Relation for Stereo-based 3D Detectors

Autor: Chen, Yilun, Huang, Shijia, Liu, Shu, Yu, Bei, Jia, Jiaya

Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors. We first revisit the prior stereo detector DSGN for its stereo volume construction ways for representing both 3D geometry and semantics. We

Externí odkaz: http://arxiv.org/abs/2204.03039

Zobrazit plný text záznamu

Report

Multi-View Transformer for 3D Visual Grounding

Autor: Huang, Shijia, Chen, Yilun, Jia, Jiaya, Wang, Liwei

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The vision-language corres

Externí odkaz: http://arxiv.org/abs/2204.02174

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání