Treebanking user-generated content: a proposal for a unified representation in universal dependencies

Autor: MANUELA SANGUINETTI, Bosco, C., Cassidy, L., Çetinoglu, Ö, Cignarella, A. T., Lynn, T., Rehbein, I., Ruppenhofer, J., Seddah, D., Zeldes, A.
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Zdroj: Sanguinetti, Manuela ORCID: 0000-0002-0147-2208 , Bosco, Cristina, Cassidy, Lauren, Çetinoglu, Özlem, Cignarella, Alessandra Teresa ORCID: 0000-0002-4409-6679 , Lynn, Teresa, Rehbein, Ines, Ruppenhofer, Josef, Seddah, Djamé and Zeldes, Amir ORCID: 0000-0001-8016-6753 (2020) Treebanking user-generated content: a proposal for a unified representation in universal dependencies. In: 12th Language Resources and Evaluation Conference. (LREC 2020), 11-16 May 2020, Marseille, France.
Scopus-Elsevier
Sanguinetti, Manuela ORCID: 0000-0002-0147-2208 , Bosco, Cristina, Cassidy, Lauren, Çetinoglu, Özlem, Cignarella, Alessandra Teresa ORCID: 0000-0002-4409-6679 , Lynn, Teresa, Rehbein, Ines, Ruppenhofer, Josef, Seddah, Djamé and Zeldes, Amir ORCID: 0000-0001-8016-6753 (2020) Treebanking user-generated content: a proposal for a unified representation in universal dependencies. In: 12th Language Resources and Evaluation Conference. (LREC 2020), 11-16 May 2020, Marseille, France. (Virtual).
Popis: The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.
Databáze: OpenAIRE