Popis: |
Massive amounts of audio-visual contents are shared in public platforms everyday. These contents are created with many purposes, from entertaining or teaching, to extremist propaganda. Civil security actors need to monitor these platforms to detect and neutralize security threats. Generating actionable knowledge from multimedia contents requires the extraction of multiple information, from linguistic data to sounds and background noises. Information extraction demands audio-visual annotations, a costly, time-consuming task when performed manually, which hinders the analysis of such an overwhelming amount of data. This work, performed in the context of the EU Horizon 2020 Project AIDA, addresses the challenge of building a robust sound detector focused on events relevant to the counter-terrorism domain. Our classification framework combines PLP features with a convolutional architecture to train a scalable model on a large number of events that is later fine-tuned on the subset of interest. The fusion of different corpora was also investigated, showing the difficulties posed by this task. With our framework, results attained an average F1-score of 0.53% on the target set of events. Of relevance, during the fine-tune phase a general-purpose class was introduced, which allowed the model to generalize on ’unseen’ events, highlighting the importance of a robust fine-tune. |