Improving lexical coverage of text simplification systems for Spanish
Autor: | Sanja Štajner, Simone Paolo Ponzetto, Horacio Saggion |
---|---|
Rok vydání: | 2019 |
Předmět: |
0209 industrial biotechnology
Phrase Lexical simplification Machine translation Synonym Text simplification business.industry Computer science General Engineering Contrast (statistics) 02 engineering and technology computer.software_genre Computer Science Applications 020901 industrial engineering & automation Artificial Intelligence 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Grammaticality Language model Artificial intelligence business computer Natural language processing |
Zdroj: | Expert Systems with Applications. 118:80-91 |
ISSN: | 0957-4174 |
DOI: | 10.1016/j.eswa.2018.08.034 |
Popis: | The current bottleneck of all data-driven lexical simplification (LS) systems is scarcity and small size of parallel corpora (original sentences and their manually simplified versions) used for training. This is especially pronounced for languages other than English. We address this problem, taking Spanish as an example of such a language, by building new simplification-specific datasets of synonyms and paraphrases using freely available resources. We test their usefulness in the LS task by adding them, in various combinations, to the existing text simplification (TS) training dataset in a phrase-based statistical machine translation (PBSMT) approach. Our best systems significantly outperform the state-of-the-art LS systems for Spanish, by the number of transformations performed and the grammaticality, simplicity and meaning preservation of the output sentences. The results of a detailed manual analysis show that some of the newly built TS resources, although they have a good lexical coverage and lead to a high number of transformations, often change the original meaning and do not generate simpler output when used in this PBSMT setup. The good combinations of these additional resources with the TS training dataset and a good choice of language model, in contrast, improve the lexical coverage and produce sentences which are grammatical, simpler than the original, and preserve the original meaning well. |
Databáze: | OpenAIRE |
Externí odkaz: |