Zobrazeno 1 - 10
of 158
pro vyhledávání: '"Räsänen, Okko"'
Audio-text relevance learning refers to learning the shared semantic properties of audio samples and textual descriptions. The standard approach uses binary relevances derived from pairs of audio samples and their human-provided captions, categorizin
Externí odkaz:
http://arxiv.org/abs/2408.14939
Autor:
Khorrami, Khazar, Räsänen, Okko
Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through sta
Externí odkaz:
http://arxiv.org/abs/2406.05259
Autor:
Räsänen, Okko, Kocharov, Daniil
Child-directed speech (CDS) is a particular type of speech that adults use when addressing young children. Its properties also change as a function of extralinguistic factors, such as age of the child being addressed. Access to large amounts of repre
Externí odkaz:
http://arxiv.org/abs/2405.07700
This paper explores grading text-based audio retrieval relevances with crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query, crowdworkers are asked to grade audio clips using numeric scores (between 0 and 100) to indicate th
Externí odkaz:
http://arxiv.org/abs/2306.09820
Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for repre
Externí odkaz:
http://arxiv.org/abs/2306.02972
Autor:
Lavechin, Marvin, Sy, Yaya, Titeux, Hadrien, Blandón, María Andrea Cruz, Räsänen, Okko, Bredin, Hervé, Dupoux, Emmanuel, Cristia, Alejandrina
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our und
Externí odkaz:
http://arxiv.org/abs/2306.01506
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a ma
Externí odkaz:
http://arxiv.org/abs/2305.11435
The recently-developed infant wearable MAIJU provides a means to automatically evaluate infants' motor performance in an objective and scalable manner in out-of-hospital settings. This information could be used for developmental research and to suppo
Externí odkaz:
http://arxiv.org/abs/2305.09366
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. The modelling encompasses properties of the input data used for training the models, the cognitive hypotheses and their algorithmic implementation
Externí odkaz:
http://arxiv.org/abs/2305.01965
This paper investigates negative sampling for contrastive learning in the context of audio-text retrieval. The strategy for negative sampling refers to selecting negatives (either audio clips or textual descriptions) from a pool of candidates for a p
Externí odkaz:
http://arxiv.org/abs/2211.04070