Including Social Media – A Very Dynamic Style – in the Corpora for Processing Romanian Language

Autor: Radu Simionescu, Cătălina Mărănduc, Cenel-Augusto Perez
Rok vydání: 2016
Předmět:
Zdroj: Communications in Computer and Information Science ISBN: 9783319329413
Popis: This paper aims to describe the process of introducing a new sub-corpus, in a new style, social media, in our UAIC-Ro-Dependency-Treebank. Our purpose is to enhance the corpus and to also include all the styles of the language. Unfortunately, the growth of the corpus is interrelated with the development of the syntactic parser. The inclusion of all the styles is a very difficult target; when parsing texts in a style for which the tools are not yet trained, the accuracy drops significantly. At least 1,000 sentences are needed for the first step of the training of the parser in a new style. We describe this first step that implies the introduction of social media style in the Treebank, the first series of orthographic, stylistic, pragmatic, lexical, semantic, syntactic, and discursive observations on this style of the language, and we communicate the first statistical evaluation of the automatic annotation.
Databáze: OpenAIRE