Semantic sentence similarity. Size does not always matter

Autor:	Stefan L. Frank, Mirjam Ernestus, Danny Merkx
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Computer science 02 engineering and technology computer.software_genre Semantics Field (computer science) Image (mathematics) Language in Interaction 030507 speech-language pathology & audiology 03 medical and health sciences Semantic similarity Similarity (psychology) 0202 electrical engineering electronic engineering information engineering Natural (music) Speech Production and Comprehension Computer Science - Computation and Language business.industry SIGNAL (programming language) 020206 networking & telecommunications 16. Peace & justice Language & Communication Artificial intelligence Grammar & Cognition Language & Speech Technology 0305 other medical science business computer Computation and Language (cs.CL) Sentence Natural language processing
Zdroj:	Proceedings of Interspeech 2021, pp. 4393-4397 Proceedings of Interspeech 2021, 4393-4397. [S.l.] : ISCA STARTPAGE=4393;ENDPAGE=4397;TITLE=Proceedings of Interspeech 2021 Proceedings of Interspeech 2021
Popis:	This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that paraphrasing is a valuable learning signal. While the general trend in the field is to create ever larger datasets to train models on, our findings indicate other characteristics of the database can just as important important. Comment: This paper has been accepted at Interspeech 2021 where it will be presented and appear in the conference proceedings in September 2021
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::819ca8d0e6082bcc41da661980f4aa11 Zobrazit plný text záznamu