Zobrazeno 1 - 10
of 2 550
pro vyhledávání: '"Berg, A L"'
Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastic
Externí odkaz:
http://arxiv.org/abs/2206.03428
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the e
Externí odkaz:
http://arxiv.org/abs/2205.01668
Autor:
Balakrishnan, Rama, Berg, Ellen L., Butler, Christopher C., Clark, Alex M., Denker, Sheryl P., Feierberg, Isabella, Harris, Jason, Ikeda, Timothy P., Jeschonek, Samantha, Makarov, Vladimir A., Southan, Christopher, Vanderwall, Dana, Winstanley, Peter
Publikováno v:
In SLAS Discovery December 2024 29(8)
Autor:
Lei, Jie, Chen, Xinlei, Zhang, Ning, Wang, Mengjiao, Bansal, Mohit, Berg, Tamara L., Yu, Licheng
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input a
Externí odkaz:
http://arxiv.org/abs/2203.05465
Autor:
Yu, Licheng, Chen, Jun, Sinha, Animesh, Wang, Mengjiao MJ, Chen, Hugo, Berg, Tamara L., Zhang, Ning
We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of t
Externí odkaz:
http://arxiv.org/abs/2202.07247
Autor:
Foster, Jennifer H., Reid, Joel M., Minard, Charles, Woodfield, Sarah, Denic, Kristina Z., Isikwei, Emasenyie, Voss, Stephan D., Nelson, Marvin, Liu, Xiaowei, Berg, Stacey L., Fox, Elizabeth, Weigel, Brenda J.
Publikováno v:
In European Journal of Cancer September 2024 209
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese q
Externí odkaz:
http://arxiv.org/abs/2108.00061
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we presen
Externí odkaz:
http://arxiv.org/abs/2107.09609
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors ar
Externí odkaz:
http://arxiv.org/abs/2102.06183
Autor:
Moore-Berg, Samantha L., Hameiri, Boaz
Publikováno v:
In Trends in Cognitive Sciences March 2024 28(3):190-192