Extraction of Indonesian and english parallel sentences from movie subtitles

Autor: Xuancong Wang, Boon Hong Yeo, Ai Ti Aw
Rok vydání: 2017
Předmět:
Zdroj: IALP
DOI: 10.1109/ialp.2017.8300602
Popis: Parallel corpus serves as a mandatory resource to develop machine-learning-based statistical translation engine. The size and coverage of parallel corpus available for training affects directly the translation accuracy of the engine. To have more training data available for the development of the translation engine in conversational domain, we propose a method to extract parallel data from Movie Subtitles using dynamic time warping, cosine similarity and beam search algorithm. The proposed method is capable of extracting 30% parallel sentences from a set of Indonesian-English movie subtitles with a precision of 98%.
Databáze: OpenAIRE