Zobrazeno 1 - 10
of 409
pro vyhledávání: '"Suarez, Pedro"'
Autor:
Brandizzi, Nicolo', Abdelwahab, Hammam, Bhowmick, Anirban, Helmer, Lennard, Stein, Benny Jörg, Denisov, Pavel, Saleem, Qasid, Fromm, Michael, Ali, Mehdi, Rutmann, Richard, Naderi, Farzad, Agy, Mohamad Saif, Schwirjow, Alexander, Küch, Fabian, Hahn, Luzian, Ostendorff, Malte, Suarez, Pedro Ortiz, Rehm, Georg, Wegener, Dennis, Flores-Herr, Nicolas, Köhler, Joachim, Leveling, Johannes
This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to
Externí odkaz:
http://arxiv.org/abs/2410.08800
Autor:
Ali, Mehdi, Fromm, Michael, Thellmann, Klaudia, Ebert, Jan, Weber, Alexander Arno, Rutmann, Richard, Jain, Charvi, Lübbering, Max, Steinigen, Daniel, Leveling, Johannes, Klug, Katrin, Buschhoff, Jasper Schulze, Jurkschat, Lena, Abdelwahab, Hammam, Stein, Benny Jörg, Sylla, Karl-Heinz, Denisov, Pavel, Brandizzi, Nicolo', Saleem, Qasid, Bhowmick, Anirban, Helmer, Lennard, John, Chelsea, Suarez, Pedro Ortiz, Ostendorff, Malte, Jude, Alex, Manjunath, Lalith, Weinbach, Samuel, Penke, Carolin, Filatov, Oleg, Asaadi, Shima, Barth, Fabio, Sifa, Rafet, Küch, Fabian, Herten, Andreas, Jäkel, René, Rehm, Georg, Kesselheim, Stefan, Köhler, Joachim, Flores-Herr, Nicolas
We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenize
Externí odkaz:
http://arxiv.org/abs/2410.03730
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of interme
Externí odkaz:
http://arxiv.org/abs/2408.04554
Autor:
Futeral, Matthieu, Zebaze, Armel, Suarez, Pedro Ortiz, Abadji, Julien, Lacroix, Rémi, Schmid, Cordelia, Bawden, Rachel, Sagot, Benoît
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and image
Externí odkaz:
http://arxiv.org/abs/2406.08707
Autor:
Ali, Mehdi, Fromm, Michael, Thellmann, Klaudia, Rutmann, Richard, Lübbering, Max, Leveling, Johannes, Klug, Katrin, Ebert, Jan, Doll, Niclas, Buschhoff, Jasper Schulze, Jain, Charvi, Weber, Alexander Arno, Jurkschat, Lena, Abdelwahab, Hammam, John, Chelsea, Suarez, Pedro Ortiz, Ostendorff, Malte, Weinbach, Samuel, Sifa, Rafet, Kesselheim, Stefan, Flores-Herr, Nicolas
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as
Externí odkaz:
http://arxiv.org/abs/2310.08754
Autor:
Foppiano, Luca, Mato, Tomoya, Terashima, Kensei, Suarez, Pedro Ortiz, Tou, Taku, Sakai, Chikako, Wang, Wei-Sheng, Amagasa, Toshiyuki, Takano, Yoshihiko, Ishii, Masashi
We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature, called SuperCon2, to enrich the existing manually-built superconductor database SuperCon. H
Externí odkaz:
http://arxiv.org/abs/2309.10923
Autor:
Suarez, Pedro Enrique Navas
Neotropical deer have important functions for the sustainability and health of their ecosystem. The investigation of pathological processes and the respective etiologic agents is a critical part of the conservation programs involving these species. I
Autor:
Laurençon, Hugo, Saulnier, Lucile, Wang, Thomas, Akiki, Christopher, del Moral, Albert Villanova, Scao, Teven Le, Von Werra, Leandro, Mou, Chenghao, Ponferrada, Eduardo González, Nguyen, Huu, Frohberg, Jörg, Šaško, Mario, Lhoest, Quentin, McMillan-Major, Angelina, Dupont, Gerard, Biderman, Stella, Rogers, Anna, allal, Loubna Ben, De Toni, Francesco, Pistilli, Giada, Nguyen, Olivier, Nikpoor, Somaieh, Masoud, Maraim, Colombo, Pierre, de la Rosa, Javier, Villegas, Paulo, Thrush, Tristan, Longpre, Shayne, Nagel, Sebastian, Weber, Leon, Muñoz, Manuel, Zhu, Jian, Van Strien, Daniel, Alyafeai, Zaid, Almubarak, Khalid, Vu, Minh Chien, Gonzalez-Dios, Itziar, Soroa, Aitor, Lo, Kyle, Dey, Manan, Suarez, Pedro Ortiz, Gokaslan, Aaron, Bose, Shamik, Adelani, David, Phan, Long, Tran, Hieu, Yu, Ian, Pai, Suhas, Chim, Jenny, Lepercq, Violette, Ilic, Suzana, Mitchell, Margaret, Luccioni, Sasha Alexandra, Jernite, Yacine
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed w
Externí odkaz:
http://arxiv.org/abs/2303.03915