Zobrazeno 1 - 10
of 21
pro vyhledávání: '"Qian, Shengju"'
Autor:
Liu, Lin, Liu, Quande, Qian, Shengju, Zhou, Yuan, Zhou, Wengang, Li, Houqiang, Xie, Lingxi, Tian, Qi
Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress a
Externí odkaz:
http://arxiv.org/abs/2406.17777
Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, e
Externí odkaz:
http://arxiv.org/abs/2404.15275
Autor:
Shao, Hao, Qian, Shengju, Xiao, Han, Song, Guanglu, Zong, Zhuofan, Wang, Letian, Liu, Yu, Li, Hongsheng
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or w
Externí odkaz:
http://arxiv.org/abs/2403.16999
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and
Externí odkaz:
http://arxiv.org/abs/2312.04302
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring
Externí odkaz:
http://arxiv.org/abs/2309.12307
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels
Externí odkaz:
http://arxiv.org/abs/2304.07547
We propose Stratified Image Transformer(StraIT), a pure non-autoregressive(NAR) generative model that demonstrates superiority in high-quality image synthesis over existing autoregressive(AR) and diffusion models(DMs). In contrast to the under-exploi
Externí odkaz:
http://arxiv.org/abs/2303.00750
The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are ca
Externí odkaz:
http://arxiv.org/abs/2212.11115
Pre-training has marked numerous state of the arts in high-level computer vision, while few attempts have ever been made to investigate how pre-training acts in image processing systems. In this paper, we tailor transformer-based pre-training regimes
Externí odkaz:
http://arxiv.org/abs/2112.10175
The transformer architectures, based on self-attention mechanism and convolution-free design, recently found superior performance and booming applications in computer vision. However, the discontinuous patch-wise tokenization process implicitly intro
Externí odkaz:
http://arxiv.org/abs/2110.15156