Abstrakt: |
Motivation: Obtaining accurate estimates of the numbers of protein-coding genes and protein domains in a proteome, and the number of protein domains in nature is a daunting challenge. Computational analysis of the protein domain sets in the proteomes of many species allows us to estimate these numbers and to find their evolution relationships. Results: We have analyzed the distributions of the number of occurrences of protein domains in sample proteomes of the 70 fully sequenced genome organisms of three major kingdoms of life: Archaea, Bacteria and Eukaryota. We found that a large fraction of the identified distinct protein domains (i.e., unique domains and homologous domain families) in these 70 proteomes (1051 (23%) out of 4493) are found in at least one organism in each of these kingdoms of life and that 43 (1%) of these domains are common to all the 70 organisms. All the observed domain occurrence frequency distributions for these 70 proteomes are well fitted by a family of Pareto-like functions, associated with the steady state distributions of a linear Markov random process. We present explicit formulas that accurately predict the number of distinct protein domains and the number of protein-coding genes for a given organism as functions of the number of non-redundant domain-to-protein links in the proteomes. These functions allows us to predict that there are 42,740, 27,900, and 21,200 protein-coding genes/open reading frames in the human,A, thaliana, and mouse genomes, respectively. We also estimate that there are 5271, 2955, and 4915 distinct protein domains in the human, A. thaliana, and mouse proteomes, respectively, and about 5500 distinct protein domains in the entire "proteome world". [ABSTRACT FROM AUTHOR] |