End-To-End Melody Note Transcription Based on a Beat-Synchronous Attention Mechanism

Autor: Kazuyoshi Yoshii, Masataka Goto, Eita Nakamura, Ryo Nishikimi
Rok vydání: 2019
Předmět:
Zdroj: WASPAA
DOI: 10.1109/waspaa.2019.8937207
Popis: This paper describes an end-to-end audio-to-symbolic singing transcription method for mixtures of vocal and accompaniment parts. Given audio signals with non-aligned melody scores, we aim to train a recurrent neural network that takes as input a magnitude spectrogram and outputs a sequence of melody notes represented by pairs of pitches and note values (durations). A promising approach to such sequence-to-sequence learning (joint input-to-output alignment and mapping) is to use an encoder-decoder model with an attention mechanism. This approach, however, cannot be used straightforwardly for singing transcription because a note-level decoder fails to estimate note values from latent representations obtained by a frame-level encoder that is good at extracting instantaneous features, but poor at extracting temporal features. To solve this problem, we focus on tatums instead of notes as output units and propose a tatum-level decoder that sequentially outputs tatum-level score segments represented by note pitches, note onset frags, and beat and downbeat flags. We then propose a beat-synchronous attention mechanism constrained in order to monotonically align tatum-level scores with input audio signals with a steady increment. The experimental results showed that the proposed method can be trained successfully from non-aligned data thanks to the beat-synchronous attention mechanism.
Databáze: OpenAIRE