Zobrazeno 1 - 10
of 96
pro vyhledávání: '"Sagot, Benoit"'
Autor:
Clérice, Thibault, Janes, Juliette, Scheithauer, Hugo, Bénière, Sarah, Cafiero, Florian, Romary, Laurent, Gabay, Simon, Sagot, Benoît
We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a lar
Externí odkaz:
http://arxiv.org/abs/2411.10068
Autor:
Antoun, Wissam, Kulumba, Francis, Touchent, Rian, de la Clergerie, Éric, Sagot, Benoît, Seddah, Djamé
French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temp
Externí odkaz:
http://arxiv.org/abs/2411.08868
Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, es
Externí odkaz:
http://arxiv.org/abs/2410.06634
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of interme
Externí odkaz:
http://arxiv.org/abs/2408.04554
The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. In this paper, we focus on machine translat
Externí odkaz:
http://arxiv.org/abs/2408.00397
Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT t
Externí odkaz:
http://arxiv.org/abs/2407.13579
Autor:
Futeral, Matthieu, Zebaze, Armel, Suarez, Pedro Ortiz, Abadji, Julien, Lacroix, Rémi, Schmid, Cordelia, Bawden, Rachel, Sagot, Benoît
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and image
Externí odkaz:
http://arxiv.org/abs/2406.08707
Publikováno v:
NAACL2024 - 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Jun 2024, Mexico City, Mexico
In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also
Externí odkaz:
http://arxiv.org/abs/2406.06589
Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller count
Externí odkaz:
http://arxiv.org/abs/2404.07647
NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robust
Externí odkaz:
http://arxiv.org/abs/2403.17220