The proteome databases and knowledge integration (DIG) axis is one of the three main research axes of PIG. Its objective is to help the analysis and interpretation of proteomics data motivated by the resolution of biological problems. Its main goals are data management and storage; data modelling and structuring; integration of experimental data and annotations and interpretation of integrated data.
Data management and storage
Proteomics databases are a first step towards analysing and rationalising proteomics data. As part of a collaboration with the wet-lab teams of Prof. Denis Hochstrasser and Dr. Jean-Charles Sanchez at the Biomedical Proteomics Research Group of Geneva University, the PIG group has created in 1993 a database of annotated protein maps, the SWISS-2DPAGE database. This database contains nearly 40 reference maps (images of 1-DE or 2-DE gels) from various species including human, mouse, Escherichia coli, etc. In addition to maintaining, annotating and developing the database, the group has created the software Make2D-DB to help naïve users to easily build their own 2-DE gel databases on their own Websites. Make2D-DB provides various keyword search mechanisms and the ability to perform queries through a graphical interface. The SWISS-2DPAGE database and databases created with the help of Make2D-DB are all interconnected among them and with the Swiss-Prot knowledgebase.
Data modelling, integration and interpretation
To characterise a protein means to gather all available information about its function, cellular localization, post-translational modifications (PTM), its interactions with other proteins, pathways where it is implicated, and other related information. It requires solving possible ambiguities of the identification, complementing functional definitions and context, as well as rationalising the co-occurrence of a set of proteins in a given sample. Characterisation is mainly achieved with on-line bioinformatics resources. However, protein-related information in most databases is accumulated rather than reduced to a synthetic view. If the goal of accumulating information is to discover or reveal the function and related biochemical mechanisms, information has to be weighed and ordered. The weight of a piece of information should reflect how often it consistently occurs in various contexts. Considering this viewpoint, the group works on research projects in bioinformatics for proteomics that are aimed at blending on-line information rather than piling it up. Some ongoing projects on these aspects are:
the MicroBioModule project: which aims to study modular combinations of domains of bacterial proteins and sets the basis of a similarity measure between proteomes. A graph-based representation of such combinations is a solution to analyse and visualise relationships between families of bacterial proteins.
the Clustering of Protein Sequences (Clips) project: which proposes a clustering approach specifically designed to handle multi-domain proteins. This method brings into play iteratively two procedures: a partitioning algorithm inspired from the traditional k-mean and a motif discovery algorithm. Partitioning is not based on linear pairwise sequence similarity measure, but rather on the motif content of each sequence. There is therefore no limitation for handling any domain architecture such as domain swapping, duplication or fusion, which hinders most available clustering methods.
the Staphylococcus aureus project: which aims at integrating multiple source data to study biofilm formation in this bacterium
Tbahriti I, Chichester C, Lisacek F, Ruch P. Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE digital library. Int J Med Inform 2006; 75: 488-495.Pubmed:16165395
Hoogland C., Mostaguir K., Sanchez J.-C., Hochstrasser D.F., Appel R.D. SWISS-2DPAGE, ten years later. Proteomics 2004, 4(8), 2352-2356. Pubmed:15274128