Zobrazeno 1 - 10
of 80
pro vyhledávání: '"LJUBEŠIČ, Nikola"'
Publikováno v:
Vandeghinste, V., & Kontino, T. (2024). CLARIN Annual Conference Proceedings 2024
This paper introduces the CLASSLA-Express workshop series as an innovative approach to disseminating linguistic resources and infrastructure provided by the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian CLARIN.SI infrastructur
Externí odkaz:
http://arxiv.org/abs/2412.01386
Autor:
Kuzman, Taja, Ljubešić, Nikola
With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a
Externí odkaz:
http://arxiv.org/abs/2411.19638
Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful t
Externí odkaz:
http://arxiv.org/abs/2409.15397
Autor:
Çöltekin, Çağrı, Kopp, Matyáš, Meden, Katja, Morkevicius, Vaidas, Ljubešić, Nikola, Erjavec, Tomaž
We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the da
Externí odkaz:
http://arxiv.org/abs/2405.07363
The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are s
Externí odkaz:
http://arxiv.org/abs/2404.05428
Autor:
Ljubešić, Nikola, Kuzman, Taja
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The coll
Externí odkaz:
http://arxiv.org/abs/2403.12721
Autor:
van Noord, Rik, Kuzman, Taja, Rupnik, Peter, Ljubešić, Nikola, Esplà-Gomis, Miquel, Ramírez-Sánchez, Gema, Toral, Antonio
Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this impo
Externí odkaz:
http://arxiv.org/abs/2403.08693
Autor:
Mayhew, Stephen, Blevins, Terra, Liu, Shuheng, Šuppa, Marek, Gonen, Hila, Imperial, Joseph Marvin, Karlsson, Börje F., Lin, Peiqin, Ljubešić, Nikola, Miranda, LJ, Plank, Barbara, Riabi, Arij, Pinter, Yuval
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standard
Externí odkaz:
http://arxiv.org/abs/2311.09122
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings. The paper additionally
Externí odkaz:
http://arxiv.org/abs/2309.09783
Autor:
Terčon, Luka, Ljubešić, Nikola
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, an
Externí odkaz:
http://arxiv.org/abs/2308.04255