Improving lexical coverage of text simplification systems for Spanish

Autor:	Sanja Štajner, Simone Paolo Ponzetto, Horacio Saggion
Rok vydání:	2019
Předmět:	0209 industrial biotechnology Phrase Lexical simplification Machine translation Synonym Text simplification business.industry Computer science General Engineering Contrast (statistics) 02 engineering and technology computer.software_genre Computer Science Applications 020901 industrial engineering & automation Artificial Intelligence 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Grammaticality Language model Artificial intelligence business computer Natural language processing
Zdroj:	Expert Systems with Applications. 118:80-91
ISSN:	0957-4174
DOI:	10.1016/j.eswa.2018.08.034
Popis:	The current bottleneck of all data-driven lexical simplification (LS) systems is scarcity and small size of parallel corpora (original sentences and their manually simplified versions) used for training. This is especially pronounced for languages other than English. We address this problem, taking Spanish as an example of such a language, by building new simplification-specific datasets of synonyms and paraphrases using freely available resources. We test their usefulness in the LS task by adding them, in various combinations, to the existing text simplification (TS) training dataset in a phrase-based statistical machine translation (PBSMT) approach. Our best systems significantly outperform the state-of-the-art LS systems for Spanish, by the number of transformations performed and the grammaticality, simplicity and meaning preservation of the output sentences. The results of a detailed manual analysis show that some of the newly built TS resources, although they have a good lexical coverage and lead to a high number of transformations, often change the original meaning and do not generate simpler output when used in this PBSMT setup. The good combinations of these additional resources with the TS training dataset and a good choice of language model, in contrast, improve the lexical coverage and produce sentences which are grammatical, simpler than the original, and preserve the original meaning well.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::fd324eeb7ee6010e6b8386a566d99290 https://doi.org/10.1016/j.eswa.2018.08.034 Zobrazit plný text záznamu Full Text from ScienceDirect