Text authorship identified using the dynamics of word co-occurrence networks
Autor: | Diego R. Amancio, Osvaldo N. Oliveira, Camilo Akimushkin |
---|---|
Rok vydání: | 2016 |
Předmět: |
FOS: Computer and information sciences
Computer science Social Sciences lcsh:Medicine computer.software_genre Bioinformatics 01 natural sciences SIGNIFICADO 010305 fluids & plasmas Mathematical and Statistical Techniques Centrality Psychology lcsh:Science Language Computer Science - Computation and Language Multidisciplinary Applied Mathematics Simulation and Modeling Complex network Semantics Autocorrelation Physical Sciences Engineering and Technology Scale-Free Networks Computation and Language (cs.CL) Natural language processing Network Analysis Algorithms Statistics (Mathematics) Research Article Computer and Information Sciences Phonology Research and Analysis Methods 0103 physical sciences Humans Syntax Statistical Methods 010306 general physics business.industry Supervised learning Scale-free network lcsh:R Cognitive Psychology Biology and Life Sciences Linguistics Probability Theory Probability Distribution Authorship Evolving networks Signal Processing Cognitive Science lcsh:Q Artificial intelligence Isomap business computer Co-occurrence networks Mathematics Neuroscience |
Zdroj: | PLoS ONE, Vol 12, Iss 1, p e0170527 (2017) Repositório Institucional da USP (Biblioteca Digital da Produção Intelectual) Universidade de São Paulo (USP) instacron:USP PLoS ONE |
DOI: | 10.48550/arxiv.1608.01965 |
Popis: | Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks. |
Databáze: | OpenAIRE |
Externí odkaz: |