Zobrazeno 1 - 10
of 239
pro vyhledávání: '"Liu Haowei"'
We introduce a novel framework for representation learning in head pose estimation (HPE). Previously such a scheme was difficult due to head pose data sparsity, making triplet sampling infeasible. Recent progress in 3D generative adversarial networks
Externí odkaz:
http://arxiv.org/abs/2412.02066
Autor:
Ye, Jiabo, Xu, Haiyang, Liu, Haowei, Hu, Anwen, Yan, Ming, Qian, Qi, Zhang, Ji, Huang, Fei, Zhou, Jingren
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we
Externí odkaz:
http://arxiv.org/abs/2408.04840
Many head pose estimation (HPE) methods promise the ability to create full-range datasets, theoretically allowing the estimation of the rotation and positioning of the head from various angles. However, these methods are only accurate within a range
Externí odkaz:
http://arxiv.org/abs/2408.01566
Autor:
Liu, Haowei, Zhang, Xi, Xu, Haiyang, Shi, Yaya, Jiang, Chaoya, Yan, Ming, Zhang, Ji, Huang, Fei, Yuan, Chunfeng, Li, Bing, Hu, Weiming
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios,
Externí odkaz:
http://arxiv.org/abs/2407.15272
Autor:
Liu, Haowei, Shi, Yaya, Xu, Haiyang, Yuan, Chunfeng, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Ji, Huang, Fei, Li, Bing, Hu, Weiming
In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not s
Externí odkaz:
http://arxiv.org/abs/2403.00249
Autor:
Liu, Haowei, Shi, Yaya, Xu, Haiyang, Yuan, Chunfeng, Ye, Qinghao, Li, Chenliang, Yan, Ming, Zhang, Ji, Huang, Fei, Li, Bing, Hu, Weiming
In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fin
Externí odkaz:
http://arxiv.org/abs/2402.16769
The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to asse
Externí odkaz:
http://arxiv.org/abs/2402.14809
Autor:
Wang, Shuxun, Lei, Yunfei, Zhang, Ziqi, Liu, Wei, Liu, Haowei, Yang, Li, Li, Wenjuan, Li, Bing, Hu, Weiming
With the rise of "Metaverse" and "Web 3.0", Non-Fungible Token (NFT) has emerged as a kind of pivotal digital asset, garnering significant attention. By the end of March 2024, more than 1.7 billion NFTs have been minted across various blockchain plat
Externí odkaz:
http://arxiv.org/abs/2402.16872
Autor:
Ye, Qinghao, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Hu, Anwen, Liu, Haowei, Qian, Qi, Zhang, Ji, Huang, Fei, Zhou, Jingren
Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile mult
Externí odkaz:
http://arxiv.org/abs/2311.04257