Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning.

Autor: Vieira LC; Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America., Handojo ML; Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America., Wilke CO; Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America.
Jazyk: angličtina
Zdroj: BioRxiv : the preprint server for biology [bioRxiv] 2024 Nov 24. Date of Electronic Publication: 2024 Nov 24.
DOI: 10.1101/2024.11.22.624936
Abstrakt: Protein language models such as the transformer-based Evolutionary Scale Modeling 2 (ESM2) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as ESM2 15B, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of all ESM2 models across many biological datasets to determine the impact of model size on transfer learning. Surprisingly, larger models do not always outperform smaller ones, especially when data is limited. Medium sized models, such as ESM2 650M, exhibited consistent performance, falling only slightly behind the 15B parameter model despite being over 20 times smaller. Additionally, we compared various methods of embedding compression to identify the most effective approach, and we found that mean embeddings consistently outperformed other compression methods. Our results show that ESM2 650M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in a variety of biological applications.
Competing Interests: The authors declare no competing interest.
Databáze: MEDLINE