Zobrazeno 1 - 10
of 26
pro vyhledávání: '"Palaskar, Shruti"'
Autor:
Palaskar, Shruti, Rudovic, Oggi, Dharur, Sameer, Pesce, Florian, Krishna, Gautam, Sivaraman, Aswin, Berkowitz, Jack, Abdelaziz, Ahmed Hussen, Adya, Saurabh, Tewfik, Ahmed
Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimo
Externí odkaz:
http://arxiv.org/abs/2406.09617
Autor:
Palaskar, Shruti, Bhagia, Akshita, Bisk, Yonatan, Metze, Florian, Black, Alan W, Marasović, Ana
Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these model
Externí odkaz:
http://arxiv.org/abs/2205.11686
Speech summarization is typically performed by using a cascade of speech recognition and text summarization models. End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio
Externí odkaz:
http://arxiv.org/abs/2110.06263
Autor:
Duarte, Amanda, Palaskar, Shruti, Ventura, Lucas, Ghadiyaram, Deepti, DeHaan, Kenneth, Metze, Florian, Torres, Jordi, Giro-i-Nieto, Xavier
One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American
Externí odkaz:
http://arxiv.org/abs/2008.08143
Off-the-shelf pre-trained Automatic Speech Recognition (ASR) systems are an increasingly viable service for companies of any size building speech-based products. While these ASR systems are trained on large amounts of data, domain mismatch is still a
Externí odkaz:
http://arxiv.org/abs/2003.07692
In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been coll
Externí odkaz:
http://arxiv.org/abs/1906.07901
End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon. In addition, word models may also be easier to integrate
Externí odkaz:
http://arxiv.org/abs/1902.06833
An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further complicated by the e
Externí odkaz:
http://arxiv.org/abs/1811.08890
Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall
Externí odkaz:
http://arxiv.org/abs/1811.03865
Autor:
Sanabria, Ramon, Caglayan, Ozan, Palaskar, Shruti, Elliott, Desmond, Barrault, Loïc, Specia, Lucia, Metze, Florian
In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech rec
Externí odkaz:
http://arxiv.org/abs/1811.00347