Výsledky vyhledávání - "Minixhofer, P"

Report

Retrofitting (Large) Language Models with Dynamic Tokenization

Autor: Feher, Darius, Minixhofer, Benjamin, Vulić, Ivan

Current language models (LMs) use a fixed, static subword tokenizer. This choice, often taken for granted, typically results in degraded efficiency and capabilities in languages other than English, and makes it challenging to apply LMs to new domains

Externí odkaz: http://arxiv.org/abs/2411.18553

Zobrazit plný text záznamu

Report

Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR

Autor: Minixhofer, Christoph, Klejch, Ondrej, Bell, Peter

Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work,

Externí odkaz: http://arxiv.org/abs/2410.12279

Zobrazit plný text záznamu

Report

TTSDS -- Text-to-Speech Distribution Score

Autor: Minixhofer, Christoph, Klejch, Ondřej, Bell, Peter

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating th

Externí odkaz: http://arxiv.org/abs/2407.12707

Zobrazit plný text záznamu

Report

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Autor: Frohmann, Markus, Sterner, Igor, Vulić, Ivan, Minixhofer, Benjamin, Schedl, Markus

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively

Externí odkaz: http://arxiv.org/abs/2406.16678

Zobrazit plný text záznamu

Report

Zero-Shot Tokenizer Transfer

Autor: Minixhofer, Benjamin, Ponti, Edoardo Maria, Vulić, Ivan

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programmin

Externí odkaz: http://arxiv.org/abs/2405.07883

Zobrazit plný text záznamu

Report

Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

Autor: Minixhofer, Benjamin, Pfeiffer, Jonas, Vulić, Ivan

Many NLP pipelines split text into sentences as one of the crucial preprocessing steps. Prior sentence segmentation tools either rely on punctuation or require a considerable amount of sentence-segmented training data: both central assumptions might

Externí odkaz: http://arxiv.org/abs/2305.18893

Zobrazit plný text záznamu

Report

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

Autor: Minixhofer, Benjamin, Pfeiffer, Jonas, Vulić, Ivan

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public d

Externí odkaz: http://arxiv.org/abs/2305.14214

Zobrazit plný text záznamu

Report

Evaluating and reducing the distance between synthetic and real speech distributions

Autor: Minixhofer, Christoph, Klejch, Ondřej, Bell, Peter

While modern Text-to-Speech (TTS) systems can produce natural-sounding speech, they remain unable to reproduce the full diversity found in natural speech data. We consider the distribution of all possible real speech samples that could be generated b

Externí odkaz: http://arxiv.org/abs/2211.16049

Zobrazit plný text záznamu

Report

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Autor: Fekih, Selim, Tamagnone, Nicolò, Minixhofer, Benjamin, Shrestha, Ranjan, Contla, Ximena, Oglethorpe, Ewan, Rekabsaz, Navid

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian r

Externí odkaz: http://arxiv.org/abs/2210.04573

Zobrazit plný text záznamu

Report

Mask-combine Decoding and Classification Approach for Punctuation Prediction with real-time Inference Constraints

Autor: Minixhofer, Christoph, Klejch, Ondřej, Bell, Peter

In this work, we unify several existing decoding strategies for punctuation prediction in one framework and introduce a novel strategy which utilises multiple predictions at each word across different windows. We show that significant improvements ca

Externí odkaz: http://arxiv.org/abs/2112.08098

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání