Mining Parallel Corpora from Sina Weibo and Twitter
Autor: | Chris Dyer, Wang Ling, Isabel Trancoso, Alan W. Black, Luís Marujo |
---|---|
Rok vydání: | 2016 |
Předmět: |
Linguistics and Language
Microblogging Computer science media_common.quotation_subject 02 engineering and technology computer.software_genre Language and Linguistics Parallel corpora Resource (project management) Artificial Intelligence 020204 information systems 0202 electrical engineering electronic engineering information engineering Social media Quality (business) media_common Training set business.industry Contrast (statistics) Data resources Computer Science Applications Dynamic programming 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing |
Zdroj: | Computational Linguistics. 42:307-343 |
ISSN: | 1530-9312 0891-2017 |
DOI: | 10.1162/coli_a_00249 |
Popis: | Microblogs such as Twitter, Facebook, and Sina Weibo (China's equivalent of Twitter) are a remarkable linguistic resource. In contrast to content from edited genres such as newswire, microblogs contain discussions of virtually every topic by numerous individuals in different languages and dialects and in different styles. In this work, we show that some microblog users post “self-translated” messages targeting audiences who speak different languages, either by writing the same message in multiple languages or by retweeting translations of their original posts in a second language. We introduce a method for finding and extracting this naturally occurring parallel data. Identifying the parallel content requires solving an alignment problem, and we give an optimally efficient dynamic programming algorithm for this. Using our method, we extract nearly 3M Chinese–English parallel segments from Sina Weibo using a targeted crawl of Weibo users who post in multiple languages. Additionally, from a random sample of Twitter, we obtain substantial amounts of parallel data in multiple language pairs. Evaluation is performed by assessing the accuracy of our extraction approach relative to a manual annotation as well as in terms of utility as training data for a Chinese–English machine translation system. Relative to traditional parallel data resources, the automatically extracted parallel data yield substantial translation quality improvements in translating microblog text and modest improvements in translating edited news content. |
Databáze: | OpenAIRE |
Externí odkaz: |