Popis: |
In this thesis, we developed and assessed a novel robust and unsupervised framework for semantic inference from composite audio signals. We focused on the problem of detecting audio scenes and grouping them into meaningful clusters. Our approach addressed all major steps in a general process of composite audio analysis, from low-level signal processing (feature extraction), via mid-level content representation (audio element extraction and weighting), to high-level semantic inference (audio scene detection and clustering). We showed experimentally that our proposed content discovery scheme involving mid-level semantic descriptors as an intermediate inference result can lead to more robustness, compared to the classical content-based audio indexing approach, where the semantics is inferred from the features directly. To the best of our knowledge, this is the first proposal exploring the possibilities for a realization of an entirely unsupervised audio content discovery system aiming at high-level semantic inference results. The first major algorithmic contribution of the thesis consists of an unsupervised approach to decompose an audio stream into (key) audio elements, based on a set of extracted audio signal features. Similar to speech recognition that transcribes a speech signal into text words, our proposed approach “transcribes” a composite audio signal into audio “words”, where each word corresponds to a short temporal segment with coherent signal properties (e.g. music, speech, noise or any combination of these). We refer to these audio words as audio elements. To extract audio elements, we deployed an iterative spectral clustering method with context-dependent scaling factors. In this process, the elementary audio segments with similar features are grouped together into clusters. Then, all audio segments belonging to the same cluster are said to represent the same audio element. We now see an audio signal as a concatenation of audio segments corresponding to different audio elements, and develop an approach similar to those known from the text document segmentation field to divide the signal into meaningful longer segments. We refer to these segments as audio scenes. To develop such an approach, we computed the weights indicating the potential of each obtained audio element to help detect an audio scene boundary. To compute these weights, again the concepts from text information retrieval have been adopted, such as the term frequency (TF) and inverse document frequency (IDF), based on which a number of their equivalents in the audio segmentation context have been introduced. As the second major algorithmic contribution of the thesis, we presented a novel approach to audio scene segmentation and clustering. We first proposed a semantic affinity measure to determine whether two audio segments are likely to belong to the same audio scene. This measure considers the audio elements contained in the analyzed segments, their importance weights and their co-occurrence statistics. Then, the presence of an audio scene boundary at a given time stamp is investigated by jointly considering the values of the semantic affinity computed for a representative number of segment pairs surrounding the observed time stamp. Once the audio scenes are detected, a scheme based on the co-clustering concept was deployed to exploit the grouping tendency among audio elements when searching for optimal audio scene clusters. Here a method based on the Bayesian information criterion (BIC) was adopted to select the numbers of clusters in the co-clustering process. Experimental evaluations on a large and representative audio data set have shown that the proposed approach can achieve encouraging results and outperform the existing related approaches. The obtained results show a relatively high purity of the obtained audio elements. The number of the obtained elements, the type of sounds they represent and the importance weights assigned to them were shown to largely correspond to the judgment of our test user panel. Moreover, for audio scene segmentation and clustering, we obtained a 70% recall of audio scene boundaries with a 80% precision, based on the ground-truth annotation obtained using a panel of human annotators. Our co-clustering based approach achieved better performance than a traditional one-directional clustering, regarding both the clustering accuracy and cluster number estimation. We completed the thesis by making an attempt to envision a possible expansion of the proposed approach towards an application scope broader than the one considered in the thesis. We first considered the applications where domain knowledge is available. For such an application we investigated the possibilities to combine our unsupervised approach with a supervised one to benefit from the available domain knowledge and so improve the content discovery performance for that domain. Then, we also performed preliminary experiments to extrapolate the applicability of the proposed approach from a single document context to a collection of (long) audio documents. This involved a shift from the concept of document-specific audio elements to an anchor space representing a large collection of audio documents. |