The Case for Retaining Natural Language Descriptions of Phenotypes in Plant Databases and a Web Application as Proof of Concept

Autor:	Diane C. Bassham, Carolyn J. Lawrence-Dill, Ian R. Braun
Rok vydání:	2021
Předmět:	Text mining Information retrieval Data curation business.industry Computer science Proof of concept Similarity (psychology) Code (cryptography) Web application business Variety (linguistics) Natural language Domain (software engineering)
DOI:	10.1101/2021.02.04.429796
Popis:	MotivationFinding similarity across phenotypic descriptions is not straightforward, with previous successes in computation requiring significant expert data curation. Natural language processing of free text phenotype descriptions is often easier to apply than intensive curation. It is therefore critical to understand the extent to which these techniques can be used to organize and analyze biological datasets and enable biological discoveries.ResultsA wide variety of approaches from the natural language processing domain perform as well as similarity metrics over curated annotations for predicting shared phenotypes. These approaches also show promise both for helping curators organize and work through large datasets as well as for enabling researchers to explore relationships among available phenotype descriptions. Here we generate networks of phenotype similarity and share a web application for querying a dataset of associated plant genes using these text mining approaches. Example situations and species for which application of these techniques is most useful are discussed.AvailabilityThe dataset used in this work is available at https://git.io/JTutQ. The code for the analysis performed here is available at https://git.io/JTutN and https://git.io/JTuqv. The code for the web application discussed here is available at https://git.io/Jtv9J, and the application itself is available at https://quoats.dill-picl.org/.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f2b0be87384c8c3e57b37fbb73c0f611 https://doi.org/10.1101/2021.02.04.429796 Zobrazit plný text záznamu