Zobrazeno 1 - 10
of 87
pro vyhledávání: '"Huang po-yao"'
Autor:
Xu, Hu, Huang, Po-Yao, Tan, Xiaoqing Ellen, Yeh, Ching-Feng, Kahn, Jacob, Jou, Christine, Ghosh, Gargi, Levy, Omer, Zettlemoyer, Luke, Yih, Wen-tau, Li, Shang-Wen, Xie, Saining, Feichtenhofer, Christoph
This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the
Externí odkaz:
http://arxiv.org/abs/2410.17251
Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a differ
Externí odkaz:
http://arxiv.org/abs/2409.14340
Autor:
Sharma, Vasu, Padthe, Karthik, Ardalani, Newsha, Tirumala, Kushal, Howes, Russell, Xu, Hu, Huang, Po-Yao, Li, Shang-Wen, Aghajanyan, Armen, Ghosh, Gargi, Zettlemoyer, Luke
In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality
Externí odkaz:
http://arxiv.org/abs/2405.01582
Autor:
Ma, Jiawei, Huang, Po-Yao, Xie, Saining, Li, Shang-Wen, Zettlemoyer, Luke, Chang, Shih-Fu, Yih, Wen-Tau, Xu, Hu
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP dat
Externí odkaz:
http://arxiv.org/abs/2404.16030
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transforme
Externí odkaz:
http://arxiv.org/abs/2403.16973
We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is calle
Externí odkaz:
http://arxiv.org/abs/2403.16242
We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction. For efficiency, FLAP randomly
Externí odkaz:
http://arxiv.org/abs/2311.01615
Autor:
Xu, Hu, Xie, Saining, Tan, Xiaoqing Ellen, Huang, Po-Yao, Howes, Russell, Sharma, Vasu, Li, Shang-Wen, Ghosh, Gargi, Zettlemoyer, Luke, Feichtenhofer, Christoph
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its
Externí odkaz:
http://arxiv.org/abs/2309.16671
Autor:
Tseng, Yuan, Berry, Layne, Chen, Yi-Ting, Chiu, I-Hsiang, Lin, Hsuan-Hao, Liu, Max, Peng, Puyuan, Shih, Yi-Jen, Wang, Hung-Yu, Wu, Haibin, Huang, Po-Yao, Lai, Chun-Mao, Li, Shang-Wen, Harwath, David, Tsao, Yu, Watanabe, Shinji, Mohamed, Abdelrahman, Feng, Chi-Luen, Lee, Hung-yi
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of l
Externí odkaz:
http://arxiv.org/abs/2309.10787
Autor:
Ryali, Chaitanya, Hu, Yuan-Ting, Bolya, Daniel, Wei, Chen, Fan, Haoqi, Huang, Po-Yao, Aggarwal, Vaibhav, Chowdhury, Arkabandhu, Poursaeed, Omid, Hoffman, Judy, Malik, Jitendra, Li, Yanghao, Feichtenhofer, Christoph
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actual
Externí odkaz:
http://arxiv.org/abs/2306.00989