Information content versus word length in random typing

Autor: Ferrer-i-Cancho, Ramon, Martín, Fermín Moscoso del Prado
Rok vydání: 2012
Předmět:
Zdroj: Journal of Statistical Mechanics, L12002 (2011)
Druh dokumentu: Working Paper
DOI: 10.1088/1742-5468/2011/12/L12002
Popis: Recently, it has been claimed that a linear relationship between a measure of information content and word length is expected from word length optimization and it has been shown that this linearity is supported by a strong correlation between information content and word length in many languages (Piantadosi et al. 2011, PNAS 108, 3825-3826). Here, we study in detail some connections between this measure and standard information theory. The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter. Although this random process does not optimize word lengths according to information content, it exhibits a linear relationship between information content and word length. The exact slope and intercept are presented for three major variants of the random typing process. A strong correlation between information content and word length can simply arise from the units making a word (e.g., letters) and not necessarily from the interplay between a word and its context as proposed by Piantadosi et al. In itself, the linear relation does not entail the results of any optimization process.
Databáze: arXiv