Popis: |
We carried out a study in which we explored the feasibility of machine translation for Twitter for the language pair English and German. As a first step we created a small bilingual corpus of 1,000 tweets. Using this corpus we carried out an analysis of the linguistic features of tweets. We tested di erent strategies of domain adaptation and found that they improved translation performance. In our experiments we found large di erences in performance due to the handling of unknown words. By using xml-markup we were able to reduce this di erence. We also replaced special Twitter expressions with placeholders, which enabled us to learn more robust n-gram statistics from Twitter data. We carried out a small-scale human evaluation to balance our automatic scores. Finally, we tested strategies to enforce translation output of legal length. Generating n-best-lists of translation candidates and searching for legal tweets was found to be helpful, but ultimately too unreliable because there was no systematic way to determine the required value of n. We suggested a feature function based on character count as a potential solution. |