EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery.

Autor: Mukherjee S; Insitro Inc, South San Francisco, California 94080, United States., McCaw ZR; Insitro Inc, South San Francisco, California 94080, United States., Pei J; Insitro Inc, South San Francisco, California 94080, United States., Merkoulovitch A; Insitro Inc, South San Francisco, California 94080, United States., Soare T; Insitro Inc, South San Francisco, California 94080, United States., Tandon R; Center for Machine Learning, Georgia Institute of Technology, Georgia 30332, United States., Amar D; Insitro Inc, South San Francisco, California 94080, United States., Somineni H; Insitro Inc, South San Francisco, California 94080, United States., Klein C; Insitro Inc, South San Francisco, California 94080, United States., Satapati S; Insitro Inc, South San Francisco, California 94080, United States., Lloyd D; Insitro Inc, South San Francisco, California 94080, United States., Probert C; Insitro Inc, South San Francisco, California 94080, United States., Koller D; Insitro Inc, South San Francisco, California 94080, United States., O'Dushlaine C; Insitro Inc, South San Francisco, California 94080, United States., Karaletsos T; Chan-Zuckerberg Initiative, Redwood City, California 94063, United States.
Jazyk: angličtina
Zdroj: Bioinformatics advances [Bioinform Adv] 2024 Sep 17; Vol. 4 (1), pp. vbae135. Date of Electronic Publication: 2024 Sep 17 (Print Publication: 2024).
DOI: 10.1093/bioadv/vbae135
Abstrakt: Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM ( Embed ding G enetic E valuation M ethods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean χ 2 statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.
Availability and Implementation: https://github.com/insitro/EmbedGEM.
Competing Interests: None declared.
(© The Author(s) 2024. Published by Oxford University Press.)
Databáze: MEDLINE