Zobrazeno 1 - 10
of 33
pro vyhledávání: '"Gutiérrez Fandiño, Asier"'
Autor:
Gutiérrez-Fandiño, Asier, Pérez-Fernández, David, Armengol-Estapé, Jordi, Griol, David, Callejas, Zoraida
In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other t
Externí odkaz:
http://arxiv.org/abs/2206.15147
The Large Labelled Logo Dataset (L3D): A Multipurpose and Hand-Labelled Continuously Growing Dataset
In this work, we present the Large Labelled Logo Dataset (L3D), a multipurpose, hand-labelled, continuously growing dataset. It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUI
Externí odkaz:
http://arxiv.org/abs/2112.05404
We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS). In financial markets, news and investor sentiment are significant drivers of security prices. Thus, leveraging the capabilities of
Externí odkaz:
http://arxiv.org/abs/2111.00526
There are many Language Models for the English language according to its worldwide relevance. However, for the Spanish language, even if it is a widely spoken language, there are very few Spanish Language Models which result to be small and too gener
Externí odkaz:
http://arxiv.org/abs/2110.12201
Autor:
Carrino, Casimiro Pio, Armengol-Estapé, Jordi, Bonet, Ona de Gibert, Gutiérrez-Fandiño, Asier, Gonzalez-Agirre, Aitor, Krallinger, Martin, Villegas, Marta
We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The
Externí odkaz:
http://arxiv.org/abs/2109.07765
Autor:
Carrino, Casimiro Pio, Armengol-Estapé, Jordi, Gutiérrez-Fandiño, Asier, Llop-Palao, Joan, Pàmies, Marc, Gonzalez-Agirre, Aitor, Villegas, Marta
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better langua
Externí odkaz:
http://arxiv.org/abs/2109.03570
Autor:
Gutiérrez-Fandiño, Asier, Armengol-Estapé, Jordi, Pàmies, Marc, Llop-Palao, Joan, Silveira-Ocampo, Joaquín, Carrino, Casimiro Pio, Gonzalez-Agirre, Aitor, Armentano-Oller, Carme, Rodriguez-Penagos, Carlos, Villegas, Marta
Publikováno v:
Procesamiento del Lenguaje Natural, v. 68, p. 39-60, mar. 2022. ISSN 1989-7553
This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, whic
Externí odkaz:
http://arxiv.org/abs/2107.07253
The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model. This is done instead of measuring intrinsic properties of the model to determine whether it is learning appropriately. I
Externí odkaz:
http://arxiv.org/abs/2106.00012
Autor:
Gutiérrez-Fandiño, Asier, Armengol-Estapé, Jordi, Carrino, Casimiro Pio, De Gibert, Ona, Gonzalez-Agirre, Aitor, Villegas, Marta
We computed both Word and Sub-word Embeddings using FastText. For Sub-word embeddings we selected Byte Pair Encoding (BPE) algorithm to represent the sub-words. We evaluated the Biomedical Word Embeddings obtaining better results than previous versio
Externí odkaz:
http://arxiv.org/abs/2102.12843
Characterizing the structural properties of neural networks is crucial yet poorly understood, and there are no well-established similarity measures between networks. In this work, we observe that neural networks can be represented as abstract simplic
Externí odkaz:
http://arxiv.org/abs/2101.07752