Popis: |
Real-world video scenes are usually very complicated as they are mixtures of many different audio-visual objects. Humans with normal hearing ability can easily locate, identify and differentiate sound sources which are heard simultaneously. However, this is an extremely difficult task for machines as the creation of machine listening algorithms that can automatically separate sound sources in difficult mixing conditions has remained very challenging. In this paper, we consider the use of a visual-guided audio source separation approach for separating sounds of different instruments in the video, where detected visual objects are used to assist the sound separation process. We particularly investigate the use of different object detectors for the task. In addition, as an empirical study, we analyze the effect of training datasets on separation performance. Finally, experiment results obtained from a benchmark dataset MUSIC confirm the advantages of the new object detector investigated in the paper. |