Use of Word Embeddings in a Literature-Based Discovery System

Autor: Reed, Toby
Jazyk: angličtina
Rok vydání: 2020
Popis: Since Don R. Swanson’s first works in the field of Literature-Based Discovery (LBD) in the 1980s, there has been a keen interest in the process’s abilities to retrieve new relationships from already published articles. In addition to this, the explosion of biomedical literature added to the public domain daily makes these automated systems more vital as time goes on with a researchers ability to keep up to date with their specialism let alone any potentially related fields. Furthermore, this emergence of LBD and the explosion of published knowledge has come at a time where the pharmaceutical industry is beginning to understand the importance of repurposing existing compounds as a method of reducing costs whilst still managing new and old conditions. This thesis proposes a system that utilises the Word2Vec group of models to implement an LBD system. These tasks are undertaken utilising seven different corpora comprised of biomedical articles related to varying levels from Raynaud Disease to Hematological journals published on the MEDLINE database and retrieved through PUBMED. This data was then fed through a specially developed pre-processing pipeline to normalise the data and then passed through a Word2Vec model and using the cosine similarity metric the most semantically similar phrases to any phrases containing the word "Raynaud". Finally, these phrases are filtered based upon their UMLS semantic type and compared to the terms found by both Weeber and Swanson to evaluate the usefulness of this method. These experiments found that when using these corpora the majority of links, an average of 88% of B-Terms for Open Discovery and an average of 81% for Closed Discovery, can still be formed. However there is still a large degree of manual curation necessary due to the imprecision of the process. This thesis shows that the development and implementation of such a system with improvements to its precision can be of use to the research community.
Databáze: OpenAIRE