Zobrazeno 1 - 10
of 225
pro vyhledávání: '"Meng, Xiaojun"'
Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot performance on visual question answering (VQA). However, when it comes to knowledge-based VQA (KB-VQA), MLLMs may lack human commonsense or specialized domain knowledge to ans
Externí odkaz:
http://arxiv.org/abs/2409.07331
Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating f
Externí odkaz:
http://arxiv.org/abs/2407.15047
Large language models (LLMs) have attracted great attention given their strong performance on a wide range of NLP tasks. In practice, users often expect generated texts to fall within a specific length range, making length controlled generation an im
Externí odkaz:
http://arxiv.org/abs/2406.10278
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated
Externí odkaz:
http://arxiv.org/abs/2403.10228
Unsupervised extractive summarization is an important technique in information extraction and retrieval. Compared with supervised method, it does not require high-quality human-labelled summaries for training and thus can be easily applied for docume
Externí odkaz:
http://arxiv.org/abs/2312.06901
Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully lever
Externí odkaz:
http://arxiv.org/abs/2308.12030
This study proposes a multitask learning architecture for extractive summarization with coherence boosting. The architecture contains an extractive summarizer and coherent discriminator module. The coherent discriminator is trained online on the sent
Externí odkaz:
http://arxiv.org/abs/2305.12851
Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of
Externí odkaz:
http://arxiv.org/abs/2305.04824
Autor:
Bai, Haoli, Liu, Zhiguang, Meng, Xiaojun, Li, Wentao, Liu, Shuang, Xie, Nian, Zheng, Rongfu, Wang, Liangwei, Hou, Lu, Wei, Jiansheng, Jiang, Xin, Liu, Qun
Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document text
Externí odkaz:
http://arxiv.org/abs/2212.09621
Recently, semantic parsing using hierarchical representations for dialog systems has captured substantial attention. Task-Oriented Parse (TOP), a tree representation with intents and slots as labels of nested tree nodes, has been proposed for parsing
Externí odkaz:
http://arxiv.org/abs/2211.14508