Inferring Context from Pixels for Multimodal Image Classification
Autor: | Manan Shah, Zhen Li, Chen Sun, Ariel Fuxman, Chao Jia, Krishnamurthy Viswanathan, Aleksei Timofeev, Chun-Ta Lu |
---|---|
Rok vydání: | 2019 |
Předmět: |
Phrase
Pixel Contextual image classification Computer science business.industry ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION Context (language use) Pattern recognition 02 engineering and technology 010501 environmental sciences 01 natural sciences Taxonomy (general) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence Focus (optics) business 0105 earth and related environmental sciences Generator (mathematics) Interpretability |
Zdroj: | CIKM |
DOI: | 10.1145/3357384.3357987 |
Popis: | Image classification models take image pixels as input and predict labels in a predefined taxonomy. While contextual information (e.g. text surrounding an image) can provide valuable orthogonal signals to improve classification, the typical setting in literature assumes the unavailability of text and thus focuses on models that rely purely on pixels. In this work, we also focus on the setting where only pixels are available in the input. However, we demonstrate that if we predict textual information from pixels, we can subsequently use the predicted text to train models that improve overall performance. We propose a framework that consists of two main components: (1) a phrase generator that maps image pixels to a contextual phrase, and (2) a multimodal model that uses textual features from the phrase generator and visual features from the image pixels to produce labels in the output taxonomy. The phrase generator is trained using web-based query-image pairs to incorporate contextual information associated with each image and has a large output space. We evaluate our framework on diverse benchmark datasets (specifically, the WebVision dataset for evaluating multi-class classification and OpenImages dataset for evaluating multi-label classification), demonstrating performance improvements over approaches based exclusively on pixels and showcasing benefits in prediction interpretability. We additionally present results to demonstrate that our framework provides improvements in few-shot learning of minimally labeled concepts. We further demonstrate the unique benefits of the multimodal nature of our framework by utilizing intermediate image/text co-embeddings to perform baseline zero-shot learning on the ImageNet dataset. |
Databáze: | OpenAIRE |
Externí odkaz: |