Výsledky vyhledávání

Report

On the Proper Treatment of Tokenization in Psycholinguistics

Autor: Giulianelli, Mario, Malagutti, Luca, Gastaldi, Juan Luis, DuSell, Brian, Vieira, Tim, Cotterell, Ryan

Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced

Externí odkaz: http://arxiv.org/abs/2410.02691

Zobrazit plný text záznamu

Report

The Foundations of Tokenization: Statistical and Computational Concerns

Autor: Gastaldi, Juan Luis, Terilla, John, Malagutti, Luca, DuSell, Brian, Vieira, Tim, Cotterell, Ryan

Tokenization - the practice of converting strings of characters over an alphabet into sequences of tokens over a vocabulary - is a critical yet under-theorized step in the NLP pipeline. Notably, it remains the only major step not fully integrated int

Externí odkaz: http://arxiv.org/abs/2407.11606

Zobrazit plný text záznamu

Report

Variational Best-of-N Alignment

Autor: Amini, Afra, Vieira, Tim, Cotterell, Ryan

Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged

Externí odkaz: http://arxiv.org/abs/2407.06057

Zobrazit plný text záznamu

Report

Direct Preference Optimization with an Offset

Autor: Amini, Afra, Vieira, Tim, Cotterell, Ryan

Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on bi

Externí odkaz: http://arxiv.org/abs/2402.10571

Zobrazit plný text záznamu

Report

Efficient Algorithms for Recognizing Weighted Tree-Adjoining Languages

Autor: Butoi, Alexandra, Vieira, Tim, Cotterell, Ryan, Chiang, David

The class of tree-adjoining languages can be characterized by various two-level formalisms, consisting of a context-free grammar (CFG) or pushdown automaton (PDA) controlling another CFG or PDA. These four formalisms are equivalent to tree-adjoining

Externí odkaz: http://arxiv.org/abs/2310.15276

Zobrazit plný text záznamu

Report

Efficient Semiring-Weighted Earley Parsing

Autor: Opedal, Andreas, Zmigrod, Ran, Vieira, Tim, Cotterell, Ryan, Eisner, Jason

This paper provides a reference description, in the form of a deduction system, of Earley's (1970) context-free parsing algorithm with various speed-ups. Our presentation includes a known worst-case runtime improvement from Earley's $O (N^3|G||R|)$,

Externí odkaz: http://arxiv.org/abs/2307.02982

Zobrazit plný text záznamu

Report

A Formal Perspective on Byte-Pair Encoding

Autor: Zouhar, Vilém, Meister, Clara, Gastaldi, Juan Luis, Du, Li, Vieira, Tim, Sachan, Mrinmaya, Cotterell, Ryan

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to so

Externí odkaz: http://arxiv.org/abs/2306.16837

Zobrazit plný text záznamu

Report

Algorithms for Acyclic Weighted Finite-State Automata with Failure Arcs

Autor: Svete, Anej, Dayan, Benjamin, Vieira, Tim, Cotterell, Ryan, Eisner, Jason

Weighted finite-state automata (WSFAs) are commonly used in NLP. Failure transitions are a useful extension for compactly representing backoffs or interpolation in $n$-gram models and CRFs, which are special cases of WFSAs. The pathsum in ordinary ac

Externí odkaz: http://arxiv.org/abs/2301.06862

Zobrazit plný text záznamu

Report

Algorithms for Weighted Pushdown Automata

Autor: Butoi, Alexandra, DuSell, Brian, Vieira, Tim, Cotterell, Ryan, Chiang, David

Weighted pushdown automata (WPDAs) are at the core of many natural language processing tasks, like syntax-based statistical machine translation and transition-based dependency parsing. As most existing dynamic programming algorithms are designed for

Externí odkaz: http://arxiv.org/abs/2210.06884

Zobrazit plný text záznamu

Report

On the Intersection of Context-Free and Regular Languages

Autor: Pasti, Clemente, Opedal, Andreas, Pimentel, Tiago, Vieira, Tim, Eisner, Jason, Cotterell, Ryan

The Bar-Hillel construction is a classic result in formal language theory. It shows, by a simple construction, that the intersection of a context-free language and a regular language is itself context-free. In the construction, the regular language i

Externí odkaz: http://arxiv.org/abs/2209.06809

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání