Zobrazeno 1 - 10
of 9 663
pro vyhledávání: '"LU Hui"'
In contemporary biomedical research, the efficiency of data-driven approaches is hindered by large data volumes, tool selection complexity, and human resource limitations, necessitating the development of fully autonomous research systems to meet com
Externí odkaz:
http://arxiv.org/abs/2407.13637
Autor:
Huang, Chao-Wei, Lu, Hui, Gong, Hongyu, Inaguma, Hirofumi, Kulikov, Ilia, Mavlyutov, Ruslan, Popuri, Sravya
Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only
Externí odkaz:
http://arxiv.org/abs/2407.03169
Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alter
Externí odkaz:
http://arxiv.org/abs/2406.19006
Prohibited Item detection in X-ray images is one of the most effective security inspection methods.However, differing from natural light images, the unique overlapping phenomena in X-ray images lead to the coupling of foreground and background featur
Externí odkaz:
http://arxiv.org/abs/2406.03176
VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewo
Externí odkaz:
http://arxiv.org/abs/2406.02940
Owing to their ability to extract relevant spatio-temporal video embeddings, Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, their generalization over domains or datasets is somewhat limited
Externí odkaz:
http://arxiv.org/abs/2403.16128
A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temp
Externí odkaz:
http://arxiv.org/abs/2403.11818
Autor:
Peng, Yifan, Kulikov, Ilia, Yang, Yilin, Popuri, Sravya, Lu, Hui, Wang, Changhan, Gong, Hongyu
There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language mo
Externí odkaz:
http://arxiv.org/abs/2403.12408
Autor:
Peng, Yifan, Kulikov, Ilia, Yang, Yilin, Popuri, Sravya, Lu, Hui, Wang, Changhan, Gong, Hongyu
Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content
Externí odkaz:
http://arxiv.org/abs/2403.12402
With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memor
Externí odkaz:
http://arxiv.org/abs/2401.13154