Distributed Text Representations Using Transformers for Noisy Written Language

Autor: Alejandro Rodriguez, Pablo Rivas, Gissella Bejarano
Rok vydání: 2022
Zdroj: LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022.
Popis: This work proposes a methodology to derive latent representations for highly noisy text. Traditionally in Natural Language Processing systems, methods rely on words as the core components of a text. Unlike those, we propose a character-based approach to be robust against our target texts’ high syntactical noise. We propose pre-training a Transformer model (BERT) on different, general-purpose language tasks and using the pre-trained model to obtain a representation for an input text. Weights are transferred from one task in the pipeline to the other. Instead of tokenizing the text on a word or sub-word basis, we propose considering the text’s characters as tokens. The ultimate goal is that the representations produced prove useful for other downstream tasks on the data, such as criminal activity in marketplace platforms.
Databáze: OpenAIRE