Zobrazeno 1 - 10
of 134
pro vyhledávání: '"She, Qi"'
In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabil
Externí odkaz:
http://arxiv.org/abs/2406.18193
Steerable models can provide very general and flexible equivariance by formulating equivariance requirements in the language of representation theory and feature fields, which has been recognized to be effective for many vision tasks. However, derivi
Externí odkaz:
http://arxiv.org/abs/2208.03720
Deep neural networks are able to memorize noisy labels easily with a softmax cross-entropy (CE) loss. Previous studies attempted to address this issue focus on incorporating a noise-robust loss function to the CE loss. However, the memorization issue
Externí odkaz:
http://arxiv.org/abs/2203.01785
Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks. Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the cl
Externí odkaz:
http://arxiv.org/abs/2203.01714
Autor:
Zhu, Lei, She, Qi, Chen, Qian, Meng, Xiangxi, Geng, Mufeng, Jin, Lujia, Jiang, Zhe, Qiu, Bin, You, Yunfei, Zhang, Yibao, Ren, Qiushi, Lu, Yanye
Weakly supervised object localization (WSOL) relaxes the requirement of dense annotations for object localization by using image-level classification masks to supervise its learning process. However, current WSOL methods suffer from excessive activat
Externí odkaz:
http://arxiv.org/abs/2112.14379
Autor:
Xiao, Junfei, Jing, Longlong, Zhang, Lin, He, Ju, She, Qi, Zhou, Zongwei, Yuille, Alan, Li, Yingwei
Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). W
Externí odkaz:
http://arxiv.org/abs/2111.13241
Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by r
Externí odkaz:
http://arxiv.org/abs/2110.08814
Autor:
Xu, Cheng, Wang, Weimin, Liu, Shuai, Wang, Yong, Tang, Yuxiang, Bian, Tianling, Yan, Yanyu, She, Qi, Yang, Cheng
In this paper, we show our solution to the Google Landmark Recognition 2021 Competition. Firstly, embeddings of images are extracted via various architectures (i.e. CNN-, Transformer- and hybrid-based), which are optimized by ArcFace loss. Then we ap
Externí odkaz:
http://arxiv.org/abs/2110.02794
Autor:
Feng, Panhe, She, Qi, Zhu, Lei, Li, Jiaxin, Zhang, Lin, Feng, Zijian, Wang, Changhu, Li, Chunpeng, Kang, Xuejing, Ming, Anlong
Retrieving occlusion relation among objects in a single image is challenging due to sparsity of boundaries in image. We observe two key issues in existing works: firstly, lack of an architecture which can exploit the limited amount of coupling in the
Externí odkaz:
http://arxiv.org/abs/2108.05722
The nonlocal-based blocks are designed for capturing long-range spatial-temporal dependencies in computer vision tasks. Although having shown excellent performance, they still lack the mechanism to encode the rich, structured information among elemen
Externí odkaz:
http://arxiv.org/abs/2108.02451