Popis: |
Locating a speaker in the space is a skill that plays an essential role in conducting smooth and natural social interactions. Equipping robots with this ability could lead to more fluid human-robot interaction, also by facilitating voice recognition in noisy environments. Most recently proposed sound localisation systems rely on model-based approaches. However, their performances depend on carefully chosen parameters, especially in the binaural and noisy settings typical of humanoids setups. The need for fine-tuning and for adaptation when facing new environments represents a considerable obstacle to the use and portability of such systems in real human-robot interaction scenarios. To overcome these limitations we propose to rely on data-driven approaches (i.e., deep learning) and exploit multi-sensory mechanisms to leverage the direct experience sensed by the robot during an interaction. Taking inspiration from how humans use vision to calibrate their auditory space representation through experiences, we enabled the robot to learn to localize a speaker in a self-supervised way. Our results show that this approach is suitable to learn to localise speakers in the challenging environments typical of human-robot collaboration. |