|
PPAP
The UniProtKB/Swiss-Prot Plant Proteome Annotation Program |
 |
PPAP in depth
Completion of the genome sequence of the model plant organism Arabidopsis thaliana in 2000 and the announcement of the complete sequencing of the rice genome in 2002 had lead us to initiate a project devoted to the annotation of plant-specific protein families [1, 2]. A major emphasis is currently put on Arabidopsis thaliana and Oryza sativa.
The complete Arabidopsis proteome is estimated to include 34'078 proteins [3]. However, automatic gene prediction is prone to errors, and the release of large quantities of full-length cDNA sequences resulted in the improvement by The Institute for Genomic Research [4] of over 27% (more then 9'000!) of the predicted gene models. The Arabidopsis Information Resource and TIGR also published lists of 3'159 and 1'188 genes, respectively, with alternatively spliced gene models [5, 6]. In most of them, the alternative splicing is located in the 5' or 3' untranslated regions of the gene and does not change the sequence of the protein produced. All the others are annotated in UniProtKB/Swiss-Prot accordingly and the various isoforms of the protein are described.
Automatic annotation of the rice genome predicts the presence of more than 55,000 genes, including some 14000 transposable elements [7]. However, recent works suggest that many putative genes without homology with Arabidopsis counterparts may be erroneous predictions, or sequences that are never translated into functional proteins in vivo. Thus, we are convinced that rice proteins also require manual annotation and we have started a specific program in 2005.
In addition, due to the polyploid nature of plant genomes (potato is tetraploids, wheat is hexaploid,..) and to the frequent genome duplications, plants are known to contain large genes families, some of which including up to 100 closely related members that can differ by only one nucleotide in the open reading frame!
From 2003 to 2005, we collaborated with Genoplante, a French partnership program in plant genomics. This project aimed to obtain extensive, homogeneous, reliable, documented, and traceable annotations for Arabidopsis nuclear genes and gene products. Working in a family-oriented manner, expert-curated annotations of paralogous genes are gathered into a database named GeneFarm [8]. When available, cross-links between UniProtKB/Swiss-Prot and GeneFarm entries are provided.
As additional plant genomes become available we will broaden our scope and increase the annotation of proteins from other plants.
- Schneider M. et al. (2004) The Swiss-Prot protein knowledgebase and ExPASy: providing the plant community with high quality proteomic data and tools. Plant Physiol Biochem. 42:1013-1021. (DOI=10.1016/j.plaphy.2004.10.009)
- Schneider M. et al. (2005) Plant protein annotation in the UniProt Knowledgebase. Plant Physiol. 138:59-66. (DOI=10.1104/pp.104.058933)
- Proteome Analysis @ EBI (Version 11/07/2006)
- Haas B. et al. (2002) Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol.; 3(6):RESEARCH0029.
- ftp://ftp.arabidopsis.org/home/tair/User_Requests/tair6_splice_variants/
- http://www.tigr.org/tdb/e2k1/ath1/altsplicing/splicing_variations.shtml
- http://www.tigr.org/tdb/e2k1/osa1/index.shtml
- Aubourg S. et al. (2005) GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts. Nucleic Acids Res. 33:D641-D646. (DOI=10.1093/nar/gki115)
|