Entropy estimation and entropy-based encoding of written Amharic language for efficient transmission in telecom networks

Autor: Tsegamlak Terefe, Dereje Hailemariam
Rok vydání: 2017
Předmět:
Zdroj: AFRICON
DOI: 10.1109/afrcon.2017.8095488
Popis: Ethio-Telecom, the sole telecom service provider in Ethiopia, is expanding its mobile network infrastructures to reach its target of over 100 million subscribers in the next few years. As a result of this expansion and with the introduction of new services a considerable amount of electronic information (eg. E-mail, Short Message Service (SMS), etc.) written in local languages is expected to be generated. Amharic, as the official language in Ethiopia spoken by over thirty million primary and secondary speakers, will contribute to most of the generated traffic. Based on the current practice Universal Transformation Format (UTF-8) encoding scheme is used to represent each Amharic symbol with 16 bits. This is costly for individual users and also advertisers that send bulk SMSs. As an alternative the Latin alphabet, which uses 8 or 7 bits/symbol, is used to facilitate the information exchange in a cost effective manner. Even though using foreign alphabets to write Amharic texts is cost effective it creates confusion and discomfort due to unavailability of standard translations and low English literacy level in the population respectively. For efficient representation of written Amharic language, this work investigates the entropy of the language and proposes entropy-based source encoding techniques. The entropy estimation is performed with Shannon's N-gram conditional and Block estimation mechanisms while Huffman and Arithmetic encoding are used for the encoding. The results show that for N=1 the entropy is 7.981 bits/symbol whereas the entropy reduces to 1.074 bits/symbol for N=15. Moreover, the Arithmetic encoding provides compression up to 72.45%. To the best of our knowledge, this is the first work to compute entropy of the language.
Databáze: OpenAIRE