Výsledky vyhledávání - "Chang, Heng-Jui"

Dissertation/ Thesis

Perturbation-invariant Speech Representation Learning by Online Clustering

Autor: Chang, Heng-Jui

Despite success across various tasks, self-supervised speech models face significant challenges in enhancing content-related performance with unlabeled data, requiring substantial computational resources. Meanwhile, learning from clustered discrete u

Externí odkaz: https://hdl.handle.net/1721.1/153784
https://orcid.org/0000-0002-1690-2610

Zobrazit plný text záznamu

Report

A Large-Scale Evaluation of Speech Foundation Models

The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of N

Externí odkaz: http://arxiv.org/abs/2404.09385

Zobrazit plný text záznamu

Report

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

Autor: Wang, Hsuan-Fu, Shih, Yi-Jen, Chang, Heng-Jui, Berry, Layne, Peng, Puyuan, Lee, Hung-yi, Wang, Hsin-Min, Harwath, David

The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP.

Externí odkaz: http://arxiv.org/abs/2402.06959

Zobrazit plný text záznamu

Report

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

Autor: Chang, Heng-Jui, Glass, James

This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves

Externí odkaz: http://arxiv.org/abs/2311.09117

Zobrazit plný text záznamu

Report

CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders

Autor: Chang, Heng-Jui, Dong, Ning, Mavlyutov, Ruslan, Popuri, Sravya, Chung, Yu-An

Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to o

Externí odkaz: http://arxiv.org/abs/2309.07707

Zobrazit plný text záznamu

Report

Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

Autor: Chang, Heng-Jui, Liu, Alexander H., Glass, James

Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method

Externí odkaz: http://arxiv.org/abs/2305.11072

Zobrazit plný text záznamu

Report

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Autor: Liu, Alexander H., Chang, Heng-Jui, Auli, Michael, Hsu, Wei-Ning, Glass, James R.

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement

Externí odkaz: http://arxiv.org/abs/2305.10005

Zobrazit plný text záznamu

Report

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Autor: Berry, Layne, Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Lee, Hung-yi, Harwath, David

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin

Externí odkaz: http://arxiv.org/abs/2211.01180

Zobrazit plný text záznamu

Report

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Autor: Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Berry, Layne, Lee, Hung-yi, Harwath, David

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhanc

Externí odkaz: http://arxiv.org/abs/2210.00705

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání