Zobrazeno 1 - 10
of 43
pro vyhledávání: '"Vijayanarasimhan, Sudheendra"'
If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions
Externí odkaz:
http://arxiv.org/abs/2302.01328
Autor:
Rathod, Vivek, Seybold, Bryan, Vijayanarasimhan, Sudheendra, Myers, Austin, Gu, Xiuye, Birodkar, Vighnesh, Ross, David A.
Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trai
Externí odkaz:
http://arxiv.org/abs/2212.10596
Autor:
Chan, David M, Ni, Yiming, Ross, David A, Vijayanarasimhan, Sudheendra, Myers, Austin, Canny, John
Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are
Externí odkaz:
http://arxiv.org/abs/2209.07518
Autor:
Chan, David M., Myers, Austin, Vijayanarasimhan, Sudheendra, Ross, David A., Seybold, Bryan, Canny, John F.
While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual descrip
Externí odkaz:
http://arxiv.org/abs/2205.06253
Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising w
Externí odkaz:
http://arxiv.org/abs/2007.13913
Autor:
Chao, Yu-Wei, Vijayanarasimhan, Sudheendra, Seybold, Bryan, Ross, David A., Deng, Jia, Sukthankar, Rahul
We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignme
Externí odkaz:
http://arxiv.org/abs/1804.07667
We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that lea
Externí odkaz:
http://arxiv.org/abs/1707.01932
Autor:
Gu, Chunhui, Sun, Chen, Ross, David A., Vondrick, Carl, Pantofaru, Caroline, Li, Yeqing, Vijayanarasimhan, Sudheendra, Toderici, George, Ricco, Susanna, Sukthankar, Rahul, Schmid, Cordelia, Malik, Jitendra
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.5
Externí odkaz:
http://arxiv.org/abs/1705.08421
Autor:
Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijayanarasimhan, Sudheendra, Viola, Fabio, Green, Tim, Back, Trevor, Natsev, Paul, Suleyman, Mustafa, Zisserman, Andrew
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human
Externí odkaz:
http://arxiv.org/abs/1705.06950
Autor:
Fragkiadaki, Katerina, Huang, Jonathan, Alemi, Alex, Vijayanarasimhan, Sudheendra, Ricco, Susanna, Sukthankar, Rahul
Given a visual history, multiple future outcomes for a video scene are equally probable, in other words, the distribution of future outcomes has multiple modes. Multimodality is notoriously hard to handle by standard regressors or classifiers: the fo
Externí odkaz:
http://arxiv.org/abs/1705.02082