Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART

Autor:	Tathe, Aniket, Kamble, Anand, Kumbharkar, Suyash, Bhandare, Atharva, Mitra, Anirban C.
Rok vydání:	2024
Předmět:	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Machine Learning Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
Druh dokumentu:	Working Paper
Popis:	This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2403.00212 Zobrazit plný text záznamu View this record from Arxiv