Node-degree aware edge sampling mitigates inflated classification performance in biomedical graph representation learning

Autor:	Luca Cappelletti, Lauren Rekerle, Tommaso Fontana, Peter Hansen, Elena Casiraghi, Vida Ravanmehr, Christopher J Mungall, Jeremy Yang, Leonard Spranger, Guy Karlebach, J. Harry Caufield, Leigh Carmody, Ben Coleman, Tudor Oprea, Justin Reese, Giorgio Valentini, Peter N Robinson
Rok vydání:	2022
DOI:	10.1101/2022.11.21.517376
Popis:	Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. We show here that this sampling strategy typically leads to sets of positive and negative edges with imbalanced edge degree distributions. Using representative homogeneous and heterogeneous biomedical knowledge graphs, we show that this strategy artificially inflates measured classification performance. We present a degree-aware node sampling approach for sampling negative edge examples that mitigates this effect and is simple to implement.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::b91f6c80ffe2a5099baac6ccb7091289 https://doi.org/10.1101/2022.11.21.517376 Zobrazit plný text záznamu