Topic evolution in scientific publications over time: A data pipeline

Autor:	Aaltonen, Kalle
Přispěvatelé:	Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences, Tampere University
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	databases LDA topic evolution Tietojenkäsittelyopin maisteriohjelma - Master's Programme in Computer Science algorithms topic modelling Non-negative Matrix Factorization NLP DTM machine learning data NMF Latent Dirichlet Allocation topic model data pipeline natural language
Popis:	This study aims to identify an optimal data pipeline for modelling topic evolution over time in scientific publications of the Tampere universities. To define a pipeline we divided it into stages of data acquisition, preprocessing, persisting and topic modelling. We then compared alternative methods of executing the stages. The final pipeline was composed of the best performing methods. As the data set we used the English-language abstracts from the Master’s theses. The data source was the Trepo repository for scientific papers of the Tampere universities. Our results show the Dynamic Non-negative Matrix Factorization (DNMF) algorithm being sig- nificantly faster to train and more versatile an implementation than the Dynamic Topic Models (DTM) algorithm. The algorithms produce very similar latent topics, where technical fields of study are dominantly present. This seems to reflect the distribution of fields of study in our cor- pus. The evolution of individual terms inside topics follow the real world trends and technological advancements to some extent. The results for the persisting layer comparison reveal PostgreSQL to be better performing than MongoDB on aggregate queries. Surprisingly this was also true for the queries targeted at the data that is stored as JSON data type inside Postgres. The fact that MongoDB is a dedicated document store and PostgreSQL is primarily a relational database management system makes this finding particularly interesting. Data acquisition results show that the most efficient way to ingest data from Trepo is through the provided OAI-PMH service. Our research does identify any reason to utilize web scraping over it. The thesis proposes a pipeline mainly from the efficiency perspective. The time-inefficiency of training the topic models needs to be taken into account when implementing a system based on the proposed data pipeline. Additionally the study highlights the possibility of using PostgreSQL as a dedicated document store.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od______4853::7a63e49dcc0f1f8f0419b9cae8cd0a7d https://trepo.tuni.fi/handle/10024/123619 Zobrazit plný text záznamu