Multimodal Recognition of Visual Concepts using Histograms of Textual Concepts and Selective Weighted Late Fusion Scheme
Autor: | Yu Zhang, Emmanuel Dellandréa, Ningning Liu, Liming Chen, Bruno Tellez, Stéphane Bres, Charles-Edmond Bichot, Chao Zhu |
---|---|
Přispěvatelé: | Extraction de Caractéristiques et Identification (imagine), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2) |
Jazyk: | angličtina |
Rok vydání: | 2013 |
Předmět: |
Scheme (programming language)
fusion Computer science Image classification textual feature visual feature 02 engineering and technology Image (mathematics) Set (abstract data type) Annotation Histogram 0202 electrical engineering electronic engineering information engineering [INFO]Computer Science [cs] computer.programming_language Image CLEF photo annotation Contextual image classification business.industry Contrast (statistics) 020207 software engineering Pattern recognition Automatic image annotation Signal Processing 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition Artificial intelligence business computer Software |
Zdroj: | Computer Vision and Image Understanding Computer Vision and Image Understanding, Elsevier, 2013, 5, 117, pp.493-512. ⟨10.1016/j.cviu.2012.10.009⟩ |
ISSN: | 1077-3142 1090-235X |
DOI: | 10.1016/j.cviu.2012.10.009⟩ |
Popis: | International audience; The text associated with images provides valuable semantic meanings about image content that can hardly be described by low-level visual features. In this paper, we propose a novel multimodal approach to automatically predict the visual concepts of images through an effective fusion of textual features along with visual ones. In contrast to the classical Bag-of-Words approach which simply relies on term frequencies, we propose a novel textual descriptor, namely the Histogram of Textual Concepts (HTC), which accounts for the relatedness of semantic concepts in accumulating the contributions of words from the image caption toward a dictionary. In addition to the popular SIFT-like features, we also evaluate a set of mid-level visual features, aiming at characterizing the harmony, dynamism and aesthetic quality of visual content, in relationship with affective concepts. Finally, a novel selective weighted late fusion (SWLF) scheme is proposed to automatically select and weight the scores from the best features according to the concept to be classified. This scheme proves particularly useful for the image annotation task with a multi-label scenario. Extensive experiments were carried out on the MIR FLICKR image collection within the ImageCLEF 2011 photo annotation challenge. Our best model, which is a late fusion of textual and visual features, achieved a MiAP (Mean interpolated Average Precision) of 43.69% and ranked 2nd out of 79 runs. We also provide comprehensive analysis of the experimental results and give some insights for future improvements. |
Databáze: | OpenAIRE |
Externí odkaz: |