SWISS-PROT RELEASE 17.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 17.0 of SWISS-PROT contains 20024 sequence entries, comprising
6'524'504 amino acids abstracted from 19591 references. This represents
an increase of 9% over release 16. The recent growth of the data bank is
summarized below:
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
1.2 Source of data
Release 17.0 has been updated using protein sequence data from release
26.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 25.0 of the
EMBL Nucleotide Sequence Data Library.
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Databank
Reference) pointer lines:
Entries with pointer(s) to only PIR entri(es): 3752
Entries with pointer(s) to only EMBL entri(es): 3713
Entries with pointer(s) to both EMBL and PIR entri(es): 12112
Entries with no pointers lines: 447
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 16
2.1 Sequences and annotations
About 1700 sequences have been added since release 16, the sequence data
of 312 existing entries has been updated and the annotations of 3750
entries have been revised. In particular we have used reviews articles
to update the annotations of the following groups or families of
proteins:
- 6-phosphogluconate dehydrogenase
- Aconitase
- Alpha-2 macroglobulin family
- ATP synthase a subunit
- Catalases
- Chalcone resveratrol synthases
- Citrate synthase
- Dihydroorotase
- DNA polymerase family A
- Eukaryotic cobalamin-binding proteins
- Fatty acid desaturases
- Fungal Zn(2)-Cys(6) binuclear cluster domain transcriptional
activators
- Gamma-glutamyltranspeptidase
- Glutamine amidotransferases class-I
- Glutamine amidotransferases class-II
- Gonadotropin-releasing hormones
- Guanylate cyclases
- LIM-1 domain proteins
- Myotoxins
- Nucleoside diphosphate kinases
- Pathogenesis-related proteins BetvI family
- Peroxidases
- Polyprenyl synthetases
- Ribosomal proteins
- Rotamases (cyclophilin and FKBP)
- Small cytokines (PF4/IL-8 and MCAF/MIP-1 subfamilies)
- Sodium symporters
- Thiol-activated cytolysins
2.2 Status of cross-references to PIR
Older releases of SWISS-PROT contained cross-references to entries which
were present only in the annotated section of PIR (currently known as
PIR1); we started adding cross-references to entries in the unannotated
sections of PIR (known as PIR2 and PIR3).
3. FORTHCOMING CHANGES
3.1 New line-types: RC and RP
We plan to implement the following change in release 19; the current RN
line will be replaced by three line types: a modified RN (Reference
Number) line type containing just the reference number, a new RC
(Reference Comment) line type containing comments relevant to the
reference (strain, tissue, etc.), and a new RP (Reference Position) line
type containing the extent of the sequencing carried out by the authors
of the reference. Three examples of the usage of these new lines are
given below.
RN [1]
RC STRAIN=K12;
RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-23.
RN [1]
RC STRAIN=BALB/C; TISSUE=BRAIN;
RP SEQUENCE OF 24-56 AND 67-89.
RN [2]
RC X-RAY CRYSTALLOGRAPHY=1.8 ANGSTROMS;
Each reference block will continue to have exactly one RN line. As many
RC lines as are needed to display the reference's comment will appear.
If a reference has no comment then the RC line will not appear. As many
RP lines as are needed to display the extent of sequencing carried out
by the authors of the reference. If a reference does not pertain
directly to sequencing data then the RP line will not appear.
3.2 New line-types: CA and CF
As we announced in the last two release notes, starting with release 18,
the enzyme entries in SWISS-PROT will have two new line-types:
CA Description_of_catalytic_activity.
CF Description_of_cofactor.
These lines will be automatically generated at each release of SWISS-
PROT from the information stored in the ENZYME data bank. They will
replace the 'CATALYTIC ACTIVITY` and 'COFACTORS` comment lines (CC)
topics. Example:
CC -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
OXALOACETATE + L-GLUTAMATE.
CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.
will be changed to:
CA L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
CF PYRIDOXAL PHOSPHATE.
3.3 Change in the OS line
As we announced in the last two release notes, starting with release 18,
we will invert the order of the information in the OS line. Currently we
have 'English common name (Latin name)`, we will switch to 'Latin name
(English common name)`. Example:
OS HUMAN (HOMO SAPIENS).
will be changed to:
OS HOMO SAPIENS (HUMAN).
4. ENZYME AND PROSITE
4.1 The ENZYME data bank
Release 4.0 of the ENZYME data bank is distributed along with release 17
of SWISS-PROT. ENZYME release 4.0 contains information relative to 3072
enzymes. The data bank is complete and up to date. Until new enzyme
nomenclature data is published we only plan to update the SWISS-PROT
pointers at each release of the protein sequence data bank, correct
eventual errors, and complete the information concerning synonyms and
cofactors using the literature.
4.2 The PROSITE data bank
Release 6.1 of the PROSITE data bank is distributed along with release
17 of SWISS-PROT. PROSITE release 6.1 does not really represent a new
release; the only changes between release 6.0 and 6.1 are updating of
the pointers to the SWISS-PROT entries whose name have been modified
between release 16 and 17. The next release of PROSITE (7.0) will be
distributed with release 18.0 of SWISS-PROT.
5. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, as for example if the function
of a protein has been clarified or if new post-translational information
has become available.
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.64 Gln (Q) 4.09 Leu (L) 9.09 Ser (S) 7.10
Arg (R) 5.24 Glu (E) 6.28 Lys (K) 5.86 Thr (T) 5.86
Asn (N) 4.44 Gly (G) 7.12 Met (M) 2.31 Trp (W) 1.30
Asp (D) 5.24 His (H) 2.27 Phe (F) 3.95 Tyr (Y) 3.21
Cys (C) 1.83 Ile (I) 5.44 Pro (P) 5.10 Val (V) 6.48
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.03
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 2630
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 1138
2x: 476
3x: 257
4x: 166
5x: 116
6x: 87
7x: 69
8x: 43
9x: 61
10x: 16
11- 20x: 98
21-100x: 81
>100x: 22
A.2.2 Table of the most represented species
Number Frequency Species
1 1659 Human
2 1376 Escherichia coli
3 940 Mouse
4 871 Rat
5 643 Baker's yeast (Saccharomyces cerevisiae)
6 453 Bovine
7 376 Fruit fly (Drosophila melanogaster)
8 331 Chicken
9 252 Bacillus subtilis
10 246 Rabbit
11 236 African clawed frog (Xenopus laevis)
12 232 Vaccinia virus (strain Copenhagen)
13 220 Pig
14 190 Human cytomegalovirus (strain AD169)
15 176 Salmonella typhimurium
16 160 Bacteriophage T4
17 142 Maize
18 124 Rice
19 111 Tobacco
20 108 Vaccinia virus (strain WR)
21 104 Pea
22 102 Wheat
23 96 Staphylococcus aureus
24 91 Slime mold (Dictyostelium discoideum)
25 86 Barley
26 85 Sheep
27 84 Liverwort (Marchantia polymorpha)
28 83 Soybean
29 82 Spinach
30 73 Caenorhabditis elegans
73 Neurospora crassa
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 1174 1001-1100 162
51- 100 2099 1101-1200 105
101- 150 3021 1201-1300 88
151- 200 1820 1301-1400 52
201- 250 1468 1401-1500 45
251- 300 1316 1501-1600 20
301- 350 1147 1601-1700 22
351- 400 1131 1701-1800 20
401- 450 867 1801-1900 22
451- 500 959 1901-2000 21
501- 550 718 2001-2100 9
551- 600 475 2101-2200 22
601- 650 336 2201-2300 24
651- 700 258 2301-2400 11
701- 750 243 2401-2500 11
751- 800 183 >2500 37
801- 850 144
851- 900 150
901- 950 95
951-1000 89
Currently the ten largest sequences are:
RYNR$RABIT 5037 a.a.
APB$HUMAN 4563 a.a.
APOA$HUMAN 4548 a.a.
POLG$BVDV 3988 a.a.
POLG$HCVA 3898 a.a.
TRX$DROME 3759 a.a.
ACVA$PENCH 3746 a.a.
DMD$HUMAN 3685 a.a.
DMD$CHICK 3660 a.a.
POLG$KUNJM 3433 a.a.