Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

Autor: Tatiana Sherstinova, Viktoria Firsanova, Alena Novoseltseva, Mariya Megre, Egor Savchenko
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: Proceedings of the XXth Conference of Open Innovations Association FRUCT, Vol 36, Iss 2, Pp 912-920 (2024)
Druh dokumentu: article
ISSN: 2305-7254
2343-0737
DOI: 10.5281/zenodo.14166352
Popis: The research focuses on the automatic annotation of a linguistic corpus using large language models (LLMs). Annotating a corpus is a crucial step in its creation, as it determines the practical scope and applications of the resource being developed. This study explores the annotation of oral speech transcripts at the pragmatic level using speech acts that reflect the speaker's intent and purpose. Typically, this task is performed manually by experts, which greatly limits the volume of annotated data that can be produced. In this work, an attempt was made to automatically annotate speech acts using five LLMs commonly used for processing Russian texts – ChatGPT, GigaCHAT, YandexGPT, Mistral, and Gemini. A comparative analysis of the automatic annotation results was conducted, highlighting the strengths and weaknesses of each model. . The findings suggest that employing LLMs for corpus annotation is a promising approach, with ChatGPT and Gemini demonstrating particular effectiveness in speech act categorization. However, for Russian, language-specific models like GigaCHAT and YandexGPT are preferred when language-specific information is needed.
Databáze: Directory of Open Access Journals