Adaptation of machine translation for multilingual information retrieval in the medical domain

Autor:	David Mareček, Ondřej Dušek, Zdeňka Urešová, Aleš Tamchyna, Johannes Leveling, Pavel Pecina, Rudolf Rosa, Jan Hajič, Martin Popel, Lorraine Goeuriot, Michal Novák, Jaroslava Hlaváčová, Liadh Kelly, Gareth J. F. Jones
Přispěvatelé:	Charles University [Prague] (CU), Modélisation et Recherche d’Information Multimédia [Grenoble] (MRIM), Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Dublin City University [Dublin] (DCU)
Jazyk:	angličtina
Rok vydání:	2014
Předmět:	Phrase Machine translation Computer science Information Storage and Retrieval Medicine (miscellaneous) 02 engineering and technology computer.software_genre Query expansion Rule-based machine translation Artificial Intelligence 0202 electrical engineering electronic engineering information engineering Information retrieval Evaluation of machine translation Cross-language information retrieval ComputingMilieux_MISCELLANEOUS Language Natural Language Processing Web search query business.industry 05 social sciences Translating Unified Medical Language System Clef [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] 020201 artificial intelligence & image processing Artificial intelligence 0509 other social sciences 050904 information & library sciences business computer Machine translating Algorithms Software Natural language processing
Zdroj:	Artificial Intelligence in Medicine Artificial Intelligence in Medicine, Elsevier, 2014, 61 (3), pp.165-185. ⟨10.1016/j.artmed.2014.01.004⟩
ISSN:	0933-3657
DOI:	10.1016/j.artmed.2014.01.004⟩
Popis:	Objective: We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. Methods and data: Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. Results: The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. Conclusions: Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3888475deebcc1659f682a3814c85cbe https://hal.archives-ouvertes.fr/hal-01921881 Zobrazit plný text záznamu Full Text from ScienceDirect