Proteogenomics: Decoding the genome using proteomics
The genome of every living organism contains the blueprints for creating proteins. Proteins are the workhorses of the cell, and perform many diverse and vital functions, including transport of nutrients, maintenance of cellular structure, cell and organism-level immunity and defense, and metabolism. However, the identification of all protein-coding elements in a genome, a fundamental first step towards understanding the basic mechanisms of life, is still an open problem even for the best-studied species. Current gene annotation pipelines derive gene predictions from many sources ranging from purely computational gene predictors to biological assays of gene activity. Recently, a new approach to this problem, termed proteogenomics, has gained popularity by querying the proteome directly using tandem mass spectrometry (MS), and working backwards to the genome to determine the gene locations. Using proteogenomics as a complementary method to current gene annotation techniques, we can achieve more complete and accurate genome annotation.
We developed an automated pipeline for interpreting MS data for the purpose of gene annotation in Zea mays, also known as corn. Corn, in particular, poses many challenges due to the size of the genome (approximately 3 billion nucleotides) and the deep sampling of the proteome by MS (nearly 500 million spectra). We developed software for fast and accurate annotation, and a statistical framework for analyzing the quality of discoveries. To date, we have discovered nearly 100 novel genes with 99% confidence. Our pipeline is deployed on a computer cluster containing 300 CPUs, and can be generalized to any organism.