Výsledky vyhledávání

Report

AI-Assisted Human Evaluation of Machine Translation

Autor: Zouhar, Vilém, Kocmi, Tom, Sachan, Mrinmaya

Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. The recently adopted annotation protocol, Error Span An

Externí odkaz: http://arxiv.org/abs/2406.12419

Zobrazit plný text záznamu

Report

Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Autor: Kocmi, Tom, Zouhar, Vilém, Avramidis, Eleftherios, Grundkiewicz, Roman, Karpinska, Marzena, Popović, Maja, Sachan, Mrinmaya, Shmatova, Mariya

High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts,

Externí odkaz: http://arxiv.org/abs/2406.11580

Zobrazit plný text záznamu

Report

Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

Autor: Moghe, Nikita, Fazla, Arnisa, Amrhein, Chantal, Kocmi, Tom, Steedman, Mark, Birch, Alexandra, Sennrich, Rico, Guillou, Liane

Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric beha

Externí odkaz: http://arxiv.org/abs/2401.16313

Zobrazit plný text záznamu

Report

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies

Autor: Kocmi, Tom, Zouhar, Vilém, Federmann, Christian, Post, Matt

Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions ab

Externí odkaz: http://arxiv.org/abs/2401.06760

Zobrazit plný text záznamu

Report

GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4

Autor: Kocmi, Tom, Federmann, Christian

This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language mode

Externí odkaz: http://arxiv.org/abs/2310.13988

Zobrazit plný text záznamu

Report

SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window

Autor: Raunak, Vikas, Kocmi, Tom, Post, Matt

Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the

Externí odkaz: http://arxiv.org/abs/2309.08832

Zobrazit plný text záznamu

Report

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

Autor: Tang, Tianyi, Lu, Hongyuan, Jiang, Yuchen Eleanor, Huang, Haoyang, Zhang, Dongdong, Zhao, Wayne Xin, Kocmi, Tom, Wei, Furu

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually b

Externí odkaz: http://arxiv.org/abs/2305.15067

Zobrazit plný text záznamu

Report

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

Autor: Lu, Qingyu, Qiu, Baopu, Ding, Liang, Zhang, Kanjian, Kocmi, Tom, Tao, Dacheng

Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing LLMs for

Externí odkaz: http://arxiv.org/abs/2303.13809

Zobrazit plný text záznamu

Report

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Autor: Kocmi, Tom, Federmann, Christian

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the avai

Externí odkaz: http://arxiv.org/abs/2302.14520

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání