Výsledky vyhledávání

Report

ViTAR: Vision Transformer with Any Resolution

Autor: Fan, Qihang, You, Quanzeng, Han, Xiaotian, Liu, Yongfei, Tao, Yunzhe, Huang, Huaibo, He, Ran, Yang, Hongxia

This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen d

Externí odkaz: http://arxiv.org/abs/2403.18361

Zobrazit plný text záznamu

Report

$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

Autor: Zhang, Yufeng, Chen, Liyu, Liu, Boyi, Yang, Yingxiang, Cui, Qiwen, Tao, Yunzhe, Yang, Hongxia

Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithm

Externí odkaz: http://arxiv.org/abs/2403.07191

Zobrazit plný text záznamu

Report

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Autor: Liu, Haogeng, You, Quanzeng, Han, Xiaotian, Wang, Yiqi, Zhai, Bohan, Liu, Yongfei, Tao, Yunzhe, Huang, Huaibo, He, Ran, Yang, Hongxia

Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensabl

Externí odkaz: http://arxiv.org/abs/2403.01487

Zobrazit plný text záznamu

Report

DeVAn: Dense Video Annotation for Video-Language Models

Autor: Liu, Tingkai, Tao, Yunzhe, Liu, Haogeng, Fan, Qihang, Zhou, Ding, Huang, Huaibo, He, Ran, Yang, Hongxia

We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video c

Externí odkaz: http://arxiv.org/abs/2310.05060

Zobrazit plný text záznamu

Report

Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling

Autor: Liu, Haogeng, Fan, Qihang, Liu, Tingkai, Yang, Linjie, Tao, Yunzhe, Huang, Huaibo, He, Ran, Yang, Hongxia

This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task. Video-Teller boosts the training efficiency by utili

Externí odkaz: http://arxiv.org/abs/2310.04991

Zobrazit plný text záznamu

Report

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

Autor: Jian, Yiren, Liu, Tingkai, Tao, Yunzhe, Zhang, Chunhui, Vosoughi, Soroush, Yang, Hongxia

In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language models (LLMs)

Externí odkaz: http://arxiv.org/abs/2310.03291

Zobrazit plný text záznamu

Report

$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis

Autor: Yu, Zishun, Tao, Yunzhe, Chen, Liyu, Sun, Tao, Yang, Hongxia

Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with lar

Externí odkaz: http://arxiv.org/abs/2310.03173

Zobrazit plný text záznamu

Report

DGRec: Graph Neural Network for Recommendation with Diversified Embedding Generation

Autor: Yang, Liangwei, Wang, Shengjie, Tao, Yunzhe, Sun, Jiankai, Liu, Xiaolong, Yu, Philip S., Wang, Taiqing

Graph Neural Network (GNN) based recommender systems have been attracting more and more attention in recent years due to their excellent performance in accuracy. Representing user-item interactions as a bipartite graph, a GNN model generates user and

Externí odkaz: http://arxiv.org/abs/2211.10486

Zobrazit plný text záznamu

Report

A Template-guided Hybrid Pointer Network for Knowledge-basedTask-oriented Dialogue Systems

Autor: Wang, Dingmin, Chen, Ziyao, He, Wanwei, Zhong, Li, Tao, Yunzhe, Yang, Min

Most existing neural network based task-oriented dialogue systems follow encoder-decoder paradigm, where the decoder purely depends on the source texts to generate a sequence of words, usually suffering from instability and poor readability. Inspired

Externí odkaz: http://arxiv.org/abs/2106.05830

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání