Zobrazeno 1 - 6
of 6
pro vyhledávání: '"Ma, Chuofan"'
Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from lea
Externí odkaz:
http://arxiv.org/abs/2410.15980
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capa
Externí odkaz:
http://arxiv.org/abs/2404.13013
Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (Vi
Externí odkaz:
http://arxiv.org/abs/2311.01373
Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language model
Externí odkaz:
http://arxiv.org/abs/2310.16667
Learning image classification and image generation using the same set of network parameters is a challenging problem. Recent advanced approaches perform well in one task often exhibit poor performance in the other. This work introduces an energy-base
Externí odkaz:
http://arxiv.org/abs/2304.02012
In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. Existing methods mainly focus on developing compact networks or alleviating temporal redundancy of video inputs to increase efficiency,
Externí odkaz:
http://arxiv.org/abs/2209.12797