Toward a Catalog of Human Genes and Proteins: Sequencing and Analysis of 500 Novel Complete Protein Coding Human cDNAs

Autor: Stefan Wiemann, Brigitte Obermaier, Stefan Bauersachs, Bernd Weil, Bernhard Korn, Michael Böcher, Michaela Klein, Wilhelm Ansorge, Dagmar Heubner, Hans-Werner Mewes, Sabine Glassl, Jürgen Lauber, Birgit Ottenwälder, Karl Köhrer, R. Wambutt, Andreas Beyer, Ruth Wellenreuther, Helmut Blum, A. Düsterhöft, Jens Tampe, Johannes Gassenhuber, Helmut Blöcker, Normann Strack, Annemarie Poustka
Přispěvatelé: Publica
Rok vydání: 2001
Předmět:
Zdroj: Genome Research. 11:422-435
ISSN: 1549-5469
1088-9051
Popis: With the complete human genomic sequence being unraveled, the focus will shift to gene identification and to the functional analysis of gene products. The generation of a set of cDNAs, both sequences and physical clones, which contains the complete and noninterrupted protein coding regions of all human genes will provide the indispensable tools for the systematic and comprehensive analysis of protein function to eventually understand the molecular basis of man. Here we report the sequencing and analysis of 500 novel human cDNAs containing the complete protein coding frame. Assignment to functional categories was possible for 52% (259) of the encoded proteins, the remaining fraction having no similarities with known proteins. By aligning the cDNA sequences with the sequences of the finished chromosomes 21 and 22 we identified a number of genes that either had been completely missed in the analysis of the genomic sequences or had been wrongly predicted. Three of these genes appear to be present in several copies. We conclude that full-length cDNA sequencing continues to be crucial also for the accurate identification of genes. The set of 500 novel cDNAs, and another 1000 full-coding cDNAs of known transcripts we have identified, adds up to cDNA representations covering 2%–5 % of all human genes. We thus substantially contribute to the generation of a gene catalog, consisting of both full-coding cDNA sequences and clones, which should be made freely available and will become an invaluable tool for detailed functional studies. [The sequence data described in this paper have been submitted to the EMBL database under the accession nos. given in Table ​Table22.] Table 2 Functional Classification of Individual cDNAsa The recent past has witnessed major advances in the determination of the sequence of the human genome (Dunham et al. 1999; Hattori et al. 2000). Although the whole genomic sequence will be completely unraveled in the near future (Collins et al. 1998), the identification of genes and the deciphering of gene structures will extend for a prolonged time, and cDNA sequences will continue to be invaluable tools for this adventure, especially in view of alternative splicing. The primary focus will shift to the functional analysis of the genes and their protein products to finally understand the molecular basis of human life. Current estimates vary between 29,000 and >70,000 genes to constitute the protein coding repertoire of the human genome (Fields et al. 1994; Ewing and Green 2000; Liang et al. 2000; Roest Crollius et al. 2000). However, thus far only some 11,000 cDNA sequences have been deposited in public databases, which are supposed to contain the complete protein coding open reading frame (ORF). The majority of the respective cDNA clones are most likely not accessible. The generation of a physical clone set representing all human genes that should be made freely accessible is consequently regarded to have an extremely high impact (Schuler 1997; Pruitt et al. 2000). This would permit the establishment of a catalog of clones to provide the resources needed in the proteomics era where the functions of proteins, their action in pathways, and the possible disease relation are deciphered. Until recently, the long-cDNA sequencing project carried out at the Kazusa Institute (Nomura et al. 1994; Nagase et al. 2000) Consortium had been the only systematic full-length cDNA sequencing project with a significant output of novel sequence information. The initiation of a new large-scale cDNA sequencing project has been announced lately that is coordinated by the National Institute of Health (Strausberg et al. 1999). We founded a cDNA Consortium in 1997 as part of the German Genome Project and aim at the characterization of the complete sequences of novel human transcripts at the cDNA level. Here, we report the sequences and analysis of 500 novel human cDNAs that all contain the complete protein coding region. These cDNAs constitute the most valuable essence of 30,000 clones that have been EST sequenced and 3630 fully sequenced cDNAs. Over 1000 cDNAs that cover the complete coding sequence of already known transcripts have been identified in the EST-sequenced clone set. All clones are made available through the Resource Center of the German Genome Project (RZPD).
Databáze: OpenAIRE