Popis: |
The field of proteogenomics operates at the interface between proteomics and genomics, and has emerged during the past decade to exploit the vast quantities of high-throughput sequence data. A range of different proteogenomics approaches have been developed, which integrate mass spectrometry data with genome sequence data to provide empirical evidence for protein-coding genes. However, current methods may not be optimized as they do not fully consider the splicing complexity in eukaryotes and there is currently no best practice method. To address this, we investigate the level of proteomics support for Ensembl gene models in human, and a selection of model organisms. We find a disparity between the number of splice variants confirmed by extant data, and the number that can theoretically be confirmed using current proteomics technologies. We then go on to investigate EST-based proteogenomics methods, which enabled the discovery of novel peptide sequences in the chicken genome, which represent hitherto unannotated genes, amended gene models, polymorphisms, and genes missing from the genome assembly. Different approaches for searching mass spectrometry data against transcript sequences are explored, and we show that searching mass spectra against protein sequences predicted by the EORF and ESTScan2 translation tools results in the best sensitivity. |