Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Autor: Angel X. Chang, Matthias Nießner, Ali Gholami, Dave Zhenyu Chen
Rok vydání: 2021
Předmět:
FOS: Computer and information sciences
Closed captioning
Computer Science - Machine Learning
Computer science
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
Learning object
Context (language use)
02 engineering and technology
Machine Learning (cs.LG)
030218 nuclear medicine & medical imaging
03 medical and health sciences
0302 clinical medicine
Margin (machine learning)
FOS: Electrical engineering
electronic engineering
information engineering

0202 electrical engineering
electronic engineering
information engineering

Computer vision
business.industry
Image and Video Processing (eess.IV)
Message passing
Electrical Engineering and Systems Science - Image and Video Processing
Object detection
Spatial relation
Graph (abstract data type)
020201 artificial intelligence & image processing
Artificial intelligence
business
Zdroj: CVPR
DOI: 10.1109/cvpr46437.2021.00321
Popis: We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).
Video: https://youtu.be/AgmIpDbwTCY
Databáze: OpenAIRE