Zobrazeno 1 - 10
of 45
pro vyhledávání: '"Oneata, Dan"'
In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset -- a de facto standard in the field of voice authenticity and deepfake detection -- can be identified with surprising accuracy using a small subset of very simplistic features
Externí odkaz:
http://arxiv.org/abs/2408.15775
Audio deepfake detection has become a pivotal task over the last couple of years, as many recent speech synthesis and voice cloning systems generate highly realistic speech samples, thus enabling their use in malicious activities. In this paper we ad
Externí odkaz:
http://arxiv.org/abs/2408.07414
Autor:
Oneata, Dan, Kamper, Herman
Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for s
Externí odkaz:
http://arxiv.org/abs/2406.07133
When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete wor
Externí odkaz:
http://arxiv.org/abs/2403.13922
The remarkable generative capabilities of denoising diffusion models have raised new concerns regarding the authenticity of the images we see every day on the Internet. However, the vast majority of existing deepfake detection models are tested again
Externí odkaz:
http://arxiv.org/abs/2311.04584
Towards generalisable and calibrated synthetic speech detection with self-supervised representations
Generalisation -- the ability of a model to perform well on unseen data -- is crucial for building reliable deepfake detectors. However, recent studies have shown that the current audio deepfake models fall short of this desideratum. In this work we
Externí odkaz:
http://arxiv.org/abs/2309.05384
We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work
Externí odkaz:
http://arxiv.org/abs/2306.11371
Most vision-and-language pretraining research focuses on English tasks. However, the creation of multilingual multimodal evaluation datasets (e.g. Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality training data that is b
Externí odkaz:
http://arxiv.org/abs/2210.13134
Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. H
Externí odkaz:
http://arxiv.org/abs/2210.04600
Publikováno v:
Sensors. 2022; 22(11):4104
The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we ta
Externí odkaz:
http://arxiv.org/abs/2206.03206