Random texts do not exhibit the real Zipf's law-like rank distribution
Autor: | Ramon Ferrer-i-Cancho, Brita Elvevåg |
---|---|
Přispěvatelé: | Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural |
Jazyk: | angličtina |
Rok vydání: | 2010 |
Předmět: |
lcsh:Medicine
Computational linguistics Computer Science/Numerical Analysis and Theoretical Computing computer.software_genre Physics/Interdisciplinary Physics Humans lcsh:Science Statistical hypothesis testing Language Probability Physics Multidisciplinary Models Statistical Zipf's law business.industry Computers Rank (computer programming) lcsh:R Models Theoretical Simple random sample Semantics Neuroscience/Psychology Character (mathematics) Probability distribution lcsh:Q Lingüística computacional Artificial intelligence Mathematics/Statistics business Informàtica::Intel·ligència artificial::Llenguatge natural [Àrees temàtiques de la UPC] computer Natural language Word (computer architecture) Natural language processing Algorithms Software Research Article |
Zdroj: | PLoS ONE, Vol 5, Iss 3, p e9411 (2010) PLoS ONE UPCommons. Portal del coneixement obert de la UPC Universitat Politècnica de Catalunya (UPC) Recercat. Dipósit de la Recerca de Catalunya instname |
Popis: | Background: Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,…) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution. Methodology/Principal Findings: In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text. Conclusions/Significance: The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages. |
Databáze: | OpenAIRE |
Externí odkaz: |