Text authorship identified using the dynamics of word co-occurrence networks

Autor: Diego R. Amancio, Osvaldo N. Oliveira, Camilo Akimushkin
Rok vydání: 2016
Předmět:
FOS: Computer and information sciences
Computer science
Social Sciences
lcsh:Medicine
computer.software_genre
Bioinformatics
01 natural sciences
SIGNIFICADO
010305 fluids & plasmas
Mathematical and Statistical Techniques
Centrality
Psychology
lcsh:Science
Language
Computer Science - Computation and Language
Multidisciplinary
Applied Mathematics
Simulation and Modeling
Complex network
Semantics
Autocorrelation
Physical Sciences
Engineering and Technology
Scale-Free Networks
Computation and Language (cs.CL)
Natural language processing
Network Analysis
Algorithms
Statistics (Mathematics)
Research Article
Computer and Information Sciences
Phonology
Research and Analysis Methods
0103 physical sciences
Humans
Syntax
Statistical Methods
010306 general physics
business.industry
Supervised learning
Scale-free network
lcsh:R
Cognitive Psychology
Biology and Life Sciences
Linguistics
Probability Theory
Probability Distribution
Authorship
Evolving networks
Signal Processing
Cognitive Science
lcsh:Q
Artificial intelligence
Isomap
business
computer
Co-occurrence networks
Mathematics
Neuroscience
Zdroj: PLoS ONE, Vol 12, Iss 1, p e0170527 (2017)
Repositório Institucional da USP (Biblioteca Digital da Produção Intelectual)
Universidade de São Paulo (USP)
instacron:USP
PLoS ONE
DOI: 10.48550/arxiv.1608.01965
Popis: Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.
Databáze: OpenAIRE