Výsledky vyhledávání - "Isbister, Tim"

Report

SWEb: A Large Web Dataset for the Scandinavian Languages

Autor: Norlund, Tobias, Isbister, Tim, Gyllensten, Amaru Cuba, Santos, Paul Dos, Petrelli, Danila, Ekgren, Ariel, Sahlgren, Magnus

This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-base

Externí odkaz: http://arxiv.org/abs/2410.04456

Zobrazit plný text záznamu

Report

GPT-SW3: An Autoregressive Language Model for the Nordic Languages

Autor: Ekgren, Ariel, Gyllensten, Amaru Cuba, Stollenwerk, Felix, Öhman, Joey, Isbister, Tim, Gogoulou, Evangelia, Carlsson, Fredrik, Heiman, Alice, Casademont, Judit, Sahlgren, Magnus

This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instructio

Externí odkaz: http://arxiv.org/abs/2305.12987

Zobrazit plný text záznamu

Report

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Autor: Öhman, Joey, Verlinden, Severine, Ekgren, Ariel, Gyllensten, Amaru Cuba, Isbister, Tim, Gogoulou, Evangelia, Carlsson, Fredrik, Sahlgren, Magnus

Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages

Externí odkaz: http://arxiv.org/abs/2303.17183

Zobrazit plný text záznamu

Report

Cross-lingual Transfer of Monolingual Models

Autor: Gogoulou, Evangelia, Ekgren, Ariel, Isbister, Tim, Sahlgren, Magnus

Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce

Externí odkaz: http://arxiv.org/abs/2109.07348

Zobrazit plný text záznamu

Report

Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

Autor: Isbister, Tim, Carlsson, Fredrik, Sahlgren, Magnus

Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions thi

Externí odkaz: http://arxiv.org/abs/2104.10441

Zobrazit plný text záznamu

Report

Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity

Autor: Isbister, Tim, Sahlgren, Magnus

This paper presents the first Swedish evaluation benchmark for textual semantic similarity. The benchmark is compiled by simply running the English STS-B dataset through the Google machine translation API. This paper discusses potential problems with

Externí odkaz: http://arxiv.org/abs/2009.03116

Zobrazit plný text záznamu

Report

Automatic Extraction of Personality from Text: Challenges and Opportunities

Autor: Akrami, Nazar, Fernquist, Johan, Isbister, Tim, Kaati, Lisa, Pelzer, Björn

In this study, we examined the possibility to extract personality traits from a text. We created an extensive dataset by having experts annotate personality traits in a large number of texts from multiple online sources. From these annotated texts, w

Externí odkaz: http://arxiv.org/abs/1910.09916

Zobrazit plný text záznamu

Report

Monitoring Targeted Hate in Online Environments

Autor: Isbister, Tim, Sahlgren, Magnus, Kaati, Lisa, Obaidi, Milan, Akrami, Nazar

Hateful comments, swearwords and sometimes even death threats are becoming a reality for many people today in online environments. This is especially true for journalists, politicians, artists, and other public figures. This paper describes how hate

Externí odkaz: http://arxiv.org/abs/1803.04757

Zobrazit plný text záznamu

Dissertation/ Thesis

Anomaly detection on social media using ARIMA models

Autor: Isbister, Tim

This thesis explores whether it is possible to capture communication patterns from web-forums and detect anomalous user behaviour. Data from individuals on web-forums can be downloaded using web-crawlers, and tools as LIWC can make the data meaningfu

Externí odkaz: http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269189

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání