Extraction of Indonesian and english parallel sentences from movie subtitles
Autor: | Xuancong Wang, Boon Hong Yeo, Ai Ti Aw |
---|---|
Rok vydání: | 2017 |
Předmět: |
Dynamic time warping
Training set business.industry Computer science Cosine similarity 02 engineering and technology computer.software_genre Translation (geometry) language.human_language Domain (software engineering) Set (abstract data type) Indonesian 030507 speech-language pathology & audiology 03 medical and health sciences 0202 electrical engineering electronic engineering information engineering language Beam search 020201 artificial intelligence & image processing Artificial intelligence 0305 other medical science business computer Natural language processing |
Zdroj: | IALP |
DOI: | 10.1109/ialp.2017.8300602 |
Popis: | Parallel corpus serves as a mandatory resource to develop machine-learning-based statistical translation engine. The size and coverage of parallel corpus available for training affects directly the translation accuracy of the engine. To have more training data available for the development of the translation engine in conversational domain, we propose a method to extract parallel data from Movie Subtitles using dynamic time warping, cosine similarity and beam search algorithm. The proposed method is capable of extracting 30% parallel sentences from a set of Indonesian-English movie subtitles with a precision of 98%. |
Databáze: | OpenAIRE |
Externí odkaz: |