Výsledky vyhledávání - "David Harwath"

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Autor: Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhanc

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::85e3f6e7297423e0345649c252d5bac1
http://arxiv.org/abs/2210.00705

Zobrazit plný text záznamu

Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech

Autor: Tyler Miller, David Harwath

Publikováno v: Interspeech 2022.

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_________::f3a8c9f2ea1696514f072ea86154b4cd
https://doi.org/10.21437/interspeech.2022-10966

Zobrazit plný text záznamu

Adversarial Input Ablation for Audio-Visual Learning

Autor: David Xu, David Harwath

Publikováno v: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_________::56050662c28dbc1be2f78e34fe2ab5a1
https://doi.org/10.1109/icassp43922.2022.9746436

Zobrazit plný text záznamu

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Autor: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-vide

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::14e9221132f9b7c09a4f9a64bf4baa19

Zobrazit plný text záznamu

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Autor: Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::633116477493a86a9340efedaeaa26e3

Zobrazit plný text záznamu

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

Autor: Reem Gody, David Harwath

Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::f2a47a66228edbe8884ddd7b7c1d24d5

Zobrazit plný text záznamu

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Autor: SouYoung Jin, James Glass, Alexander H. Liu, Mathew Monfort, Aude Oliva, David Harwath, Rogerio Feris

Publikováno v: CVPR

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how)

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::5b2a854ae31c856c91683142ef07b4da
https://doi.org/10.1109/cvpr46437.2021.01463

Zobrazit plný text záznamu

Cascaded Multilingual Audio-Visual Learning from Videos

Autor: Rameswar Panda, Andrew Rouditchenko, Hilde Kuehne, Angie Boggust, James Glass, Rogerio Feris, David Harwath, Brian Chen, Brian Kingsbury, Samuel Thomas, Michael Picheny

In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but the

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::4e8574a412328c0178369baec3688670

Zobrazit plný text záznamu

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Autor: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::fba61791fda0c6282401fe03b281bcb5

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání