Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Autor:	Schmid, Florian, Primus, Paul, Morocutti, Tobias, Greif, Jonathan, Widmer, Gerhard
Rok vydání:	2024
Předmět:	Electrical Engineering and Systems Science - Audio and Speech Processing Computer Science - Sound
Druh dokumentu:	Working Paper
Popis:	This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets. Comment: Technical Report describing our system for DCASE2024 Challenge Task 4: https://dcase.community/challenge2024/task-sound-event-detection-with-heterogeneous-training-dataset-and-potentially-missing-labels-results Code: https://github.com/CPJKU/cpjku_dcase24. arXiv admin note: text overlap with arXiv:2407.12997
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2408.00791 Zobrazit plný text záznamu View this record from Arxiv