Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

Autor:	He, Haonan, Ren, Yuchen, Tang, Yining, Xu, Ziyang, Li, Junxian, Yang, Minghao, Zhang, Di, Yuan, Dong, Chen, Tao, Zhang, Shufei, Li, Yuqiang, Dong, Nanqing, Ouyang, Wanli, Zhou, Dongzhan, Ye, Peng
Rok vydání:	2024
Předmět:	Quantitative Biology - Biomolecules Computer Science - Artificial Intelligence Computer Science - Machine Learning
Druh dokumentu:	Working Paper
Popis:	Large language models have already demonstrated their formidable capabilities in general domains, ushering in a revolutionary transformation. However, exploring and exploiting the extensive knowledge of these models to comprehend multi-omics biology remains underexplored. To fill this research gap, we first introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset including DNA, RNA, proteins, and multi-molecules, designed to bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. This dataset can enhance the versatility of LLMs by integrating diverse biological sequenced-based prediction tasks with advanced reasoning capabilities, while maintaining conversational fluency. Additionally, we reveal significant performance limitations in even state-of-the-art LLMs on biological sequence-related multi-omics tasks without specialized pre-training and instruction-tuning. We further develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline, demonstrating the powerful ability to understand biology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics are publicly available and crucial resources for enabling more effective integration of LLMs with multi-omics sequence analysis.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2412.19191 Zobrazit plný text záznamu View this record from Arxiv