Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

Autor:	Lin Ma, Kwan-Yee K. Wong, Qi Wu, Peng Wang, Zhenfang Chen
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences Referring expression business.industry Principle of compositionality Computer science Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 02 engineering and technology Visual reasoning 010501 environmental sciences computer.software_genre Referent Semantics 01 natural sciences Expression (mathematics) Visualization Comprehension 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language Natural language processing 0105 earth and related environmental sciences
Zdroj:	CVPR
Popis:	Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning chain embodied in an expression, we propose a new test setting by adding additional distracting images containing objects sharing similar properties with the referent, thus minimising the success rate of reasoning-free cross-domain alignment. We evaluate several state-of-the-art REF models, but find none of them can achieve promising performance. A proposed modular hard mining strategy performs the best but still leaves substantial room for improvement. We hope this new dataset and task can serve as a benchmark for deeper visual reasoning analysis and foster the research on referring expression comprehension. To appear in CVPR2020
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::28d6c9a3b69e1e485232fd1e6ff0c702 http://arxiv.org/abs/2003.00403 Zobrazit plný text záznamu