Výsledky vyhledávání - "Berry, Layne"

Report

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

Autor: Wang, Hsuan-Fu, Shih, Yi-Jen, Chang, Heng-Jui, Berry, Layne, Peng, Puyuan, Lee, Hung-yi, Wang, Hsin-Min, Harwath, David

The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP.

Externí odkaz: http://arxiv.org/abs/2402.06959

Zobrazit plný text záznamu

Report

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Autor: Fang, Hung-Chieh, Ye, Nai-Xuan, Shih, Yi-Jen, Peng, Puyuan, Wang, Hsuan-Fu, Berry, Layne, Lee, Hung-yi, Harwath, David

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks

Externí odkaz: http://arxiv.org/abs/2402.05819

Zobrazit plný text záznamu

Report

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Autor: Tseng, Yuan, Berry, Layne, Chen, Yi-Ting, Chiu, I-Hsiang, Lin, Hsuan-Hao, Liu, Max, Peng, Puyuan, Shih, Yi-Jen, Wang, Hung-Yu, Wu, Haibin, Huang, Po-Yao, Lai, Chun-Mao, Li, Shang-Wen, Harwath, David, Tsao, Yu, Watanabe, Shinji, Mohamed, Abdelrahman, Feng, Chi-Luen, Lee, Hung-yi

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of l

Externí odkaz: http://arxiv.org/abs/2309.10787

Zobrazit plný text záznamu

Report

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Autor: Berry, Layne, Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Lee, Hung-yi, Harwath, David

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin

Externí odkaz: http://arxiv.org/abs/2211.01180

Zobrazit plný text záznamu

Report

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Autor: Diwan, Anuj, Berry, Layne, Choi, Eunsol, Harwath, David, Mahowald, Kyle

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning. Yet, they fail miserably on the recently proposed Winoground dataset, which challenges models to match paired images

Externí odkaz: http://arxiv.org/abs/2211.00768

Zobrazit plný text záznamu

Report

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Autor: Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Berry, Layne, Lee, Hung-yi, Harwath, David

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhanc

Externí odkaz: http://arxiv.org/abs/2210.00705

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání