Zobrazeno 1 - 10
of 75
pro vyhledávání: '"Berg, Tamara L."'
Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastic
Externí odkaz:
http://arxiv.org/abs/2206.03428
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the e
Externí odkaz:
http://arxiv.org/abs/2205.01668
Autor:
Lei, Jie, Chen, Xinlei, Zhang, Ning, Wang, Mengjiao, Bansal, Mohit, Berg, Tamara L., Yu, Licheng
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input a
Externí odkaz:
http://arxiv.org/abs/2203.05465
Autor:
Yu, Licheng, Chen, Jun, Sinha, Animesh, Wang, Mengjiao MJ, Chen, Hugo, Berg, Tamara L., Zhang, Ning
We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of t
Externí odkaz:
http://arxiv.org/abs/2202.07247
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese q
Externí odkaz:
http://arxiv.org/abs/2108.00061
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we presen
Externí odkaz:
http://arxiv.org/abs/2107.09609
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors ar
Externí odkaz:
http://arxiv.org/abs/2102.06183
Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of co
Externí odkaz:
http://arxiv.org/abs/2010.07999
Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal,
Externí odkaz:
http://arxiv.org/abs/2005.05402
We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K vi
Externí odkaz:
http://arxiv.org/abs/2001.09099