A Comparison of Entity Matching Methods between English and Japanese Katakana

Autor:	Hidekazu Oiwa, Michiharu Yamashita, Hideki Awashima
Rok vydání:	2018
Předmět:	Kanji Computer science business.industry Katakana Similarity measure computer.software_genre Similarity (network science) Transliteration Onomatopoeia Artificial intelligence String metric business computer Natural language processing Japanese writing system
Zdroj:	Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology.
Popis:	Japanese Katakana is one component of the Japanese writing system and is used to express English terms, loanwords, and onomatopoeia in Japanese characters based on the phonemes. The main purpose of this research is to find the best entity matching methods between English and Katakana. We built two research questions to clarify which types of entity matching systems works better than others. The first question is what transliteration should be used for conversion. We need to transliterate English or Katakana terms into the same form in order to compute the string similarity. We consider five conversions that transliterate English to Katakana directly, Katakana to English directly, English to Katakana via phoneme, Katakana to English via phoneme, and both English and Katakana to phoneme. The second question is what should be used for the similarity measure at entity matching. To investigate the problem, we choose six methods, which are Overlap Coefficient, Cosine, Jaccard, Jaro-Winkler, Levenshtein, and the similarity of the phoneme probability predicted by RNN. Our results show that 1) matching using phonemes and conversion of Katakana to English works better than other methods, and 2) the similarity of phonemes outperforms other methods while other similarity score is changed depending on data and models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::0fb56ddd620967b76a0026f6c46ee484 https://doi.org/10.18653/v1/w18-5809 Zobrazit plný text záznamu