Výsledky vyhledávání - "LJUBEŠIČ, Nikola"

Report

CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources and Tools with Easily Expanding Route

Autor: Ljubešić, Nikola, Kuzman, Taja, Petrović, Ivana Filipović, Parizoska, Jelena, Osenova, Petya

Publikováno v: Vandeghinste, V., & Kontino, T. (2024). CLARIN Annual Conference Proceedings 2024

This paper introduces the CLASSLA-Express workshop series as an innovative approach to disseminating linguistic resources and infrastructure provided by the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian CLARIN.SI infrastructur

Externí odkaz: http://arxiv.org/abs/2412.01386

Zobrazit plný text záznamu

Report

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Autor: Kuzman, Taja, Ljubešić, Nikola

With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a

Externí odkaz: http://arxiv.org/abs/2411.19638

Zobrazit plný text záznamu

Report

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Autor: Ljubešić, Nikola, Rupnik, Peter, Koržinek, Danijel

Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful t

Externí odkaz: http://arxiv.org/abs/2409.15397

Zobrazit plný text záznamu

Report

Multilingual Power and Ideology Identification in the Parliament: a Reference Dataset and Simple Baselines

Autor: Çöltekin, Çağrı, Kopp, Matyáš, Meden, Katja, Morkevicius, Vaidas, Ljubešić, Nikola, Erjavec, Tomaž

We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the da

Externí odkaz: http://arxiv.org/abs/2405.07363

Zobrazit plný text záznamu

Report

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

Autor: Ljubešić, Nikola, Suchomel, Vít, Rupnik, Peter, Kuzman, Taja, van Noord, Rik

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are s

Externí odkaz: http://arxiv.org/abs/2404.05428

Zobrazit plný text záznamu

Report

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

Autor: Ljubešić, Nikola, Kuzman, Taja

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The coll

Externí odkaz: http://arxiv.org/abs/2403.12721

Zobrazit plný text záznamu

Report

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

Autor: van Noord, Rik, Kuzman, Taja, Rupnik, Peter, Ljubešić, Nikola, Esplà-Gomis, Miquel, Ramírez-Sánchez, Gema, Toral, Antonio

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this impo

Externí odkaz: http://arxiv.org/abs/2403.08693

Zobrazit plný text záznamu

Report

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Autor: Mayhew, Stephen, Blevins, Terra, Liu, Shuheng, Šuppa, Marek, Gonen, Hila, Imperial, Joseph Marvin, Karlsson, Börje F., Lin, Peiqin, Ljubešić, Nikola, Miranda, LJ, Plank, Barbara, Riabi, Arij, Pinter, Yuval

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standard

Externí odkaz: http://arxiv.org/abs/2311.09122

Zobrazit plný text záznamu

Report

The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings

Autor: Mochtak, Michal, Rupnik, Peter, Ljubešić, Nikola

The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings. The paper additionally

Externí odkaz: http://arxiv.org/abs/2309.09783

Zobrazit plný text záznamu

Report

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

Autor: Terčon, Luka, Ljubešić, Nikola

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, an

Externí odkaz: http://arxiv.org/abs/2308.04255

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání