ECCB

Tutorials

Wednesday
September 28, 2005

14:15/18:00 - Room 9 (2nd Floor)
T3: InterPro, exploring a powerful protein diagnostic tool
Dr. Jennifer McDowall

InterPro is an integrated protein resource that provides protein annotation and classification at family and domain levels. InterPro combines the major signature databases, PROSITE, PRINTS, PFAM, PRODOM, SMART, TIGRFAM, PIR Superfamily, GENE3D, SUPERFAMILY, and PANTHER, as well as structural information from PDB, MSD, CATH, SCOP and SWISS-MODEL, into a unified database. This tutorial is designed to allow users to get the most value out of the database, and will focus on the type and organisation of annotated data, the different query methods possible, understanding the different visualisations of the data, as well as exploring the multiple external links and cross-references available.

Synopsis:

The aim of this tutorial is to familiarise users with the wealth of annotated protein data available within the InterPro database, how to extract this information, and how to use InterPro to analyse and annotate protein sequences using the web interface. By the end of the tutorial, participants should be able to confidently navigate InterPro, be able to carry out a variety of search queries, be aware of all the different data presentations available, what information they yield, and what external links and cross-references are available.

OUTLINE OF TOPICS COVERED
Introduction to InterPro and fundamentals of the member databases
Overview of the InterPro entry
Understanding the graphical views
InterPro information mining
InterPro sequence analysis

INTRODUCTION TO INTERPRO AND FUNDAMENTALS OF THE MEMBER DATABASES
InterPro is a searchable database that provides information on the function, annotation and classification of proteins, with over 80% of the proteins in UniProt being represented in InterPro. InterPro combines the major signature databases into a unified protein database, with each member database using different methods to derive their signatures: Prosite (patterns and profiles), Prints (motifs), Prodom (sequence clustering), Pfam (HMM), Smart (HMM), Tigrfam (HMM), PIR Superfamily (HMM), Gene3D (HMM), Superfamily (HMM), and Panther (HMM). Furthermore, each member database refines their methods through their choice of seed alignments and post-processing techniques, in order to attain specific goals. Prosite patterns, Prints, Tigrfam, PIR Superfamily and Panther tend to concentrate on identifying proteins families, grouping them with regard to sequence or functional restraints. Prosite profiles, Prodom, Smart, Pfam, Gene3D and Superfamily are often used to identify domains, grouping them by functional, sequence or structural relatedness. By combining the different methods, InterPro is able to produce a hierarchical classification scheme that provides information on the relationships between different proteins families in terms of sequence, structural and functional divergence, as well as providing information on the domain architecture of the proteins within these families. The tutorial will give an overview of the different signature methods used by the member databases, and the contribution each makes to InterPro.

OVERVIEW OF THE INTERPRO ENTRY
Protein sequences are grouped together into InterPro entries based on the protein signatures from the member databases. The groups are defined as families, domains, repeats or sites. Each InterPro entry has a unique accession number, name, abstract describing features of the proteins associated with the entry, and literature references with links to PubMed. The signatures that define the entry are listed along with links to the relevant entries in the member databases, providing direct access to the documentation for each signature. Relationships with other entries are listed with their links. These relationships can be either parent/child type, which divide families or domains into more closely related sub-groups to produce a hierarchical classification scheme, or contains/found in type, which describe the organisation of domains, repeats and sites within families. Entries are also annotated with respect to GO terms, providing information on the process, function and component of the proteins within an entry. InterPro entries contain a variety of external links, including those to the structural databases PDB, EMSD, CATH and SCOP, and to the databases MEROPS, PANDIT, Blocks, IntEnz, CAZy, IUPHAR, COMe and CluSTr. The taxonomic coverage of an entry is displayed by a descriptive wheel, which permits the user to select all the sequences from specific taxonomic groups. The tutorial will cover all the features of the InterPro entry, and the external links and cross-references provided, showing participants how to extract the information they need.

UNDERSTANDING THE GRAPHICAL VIEWS
InterPro provides a number of different graphical views to display the protein matches making up an entry. The detailed view provides a graphical description for each of the proteins in an entry ordered by either accession number or name, or restricts the display to only those proteins whose structure is known. All the signatures that hit the proteins are displayed, thereby providing a comprehensive view of each protein, with links provided to the member databases and to related InterPro entries. The structural features of proteins as described by PDB, CATH, SCOP and SWISS-MODEL are displayed two-dimensionally in relation to the signatures, as well as three-dimensionally using AstexViewer. InterPro also includes an overview, where the signatures and structural features are condensed into a simplified graphical view, as well as a tabular format of the protein data. The different domain combinations found in the set of proteins within an entry can be viewed using the InterPro Domain Architecture. The tutorial will provide an in-depth look at the different ways to view the signature information, how to mine the information contained within the views, and how to navigate between the different views and their external links. The use of AstexViewer to visualise the structural features of the proteins will also be explored.

INTERPRO INFORMATION MINING
InterPro can be searched in a number of different ways. The simple text search facility allows queries using keywords, UniProt accession numbers, GO terms, or InterPro entry numbers. The simple InterPro SRS search enables more complex queries, providing two field queries, one from InterPro and the other from the list of protein matches. InterPro can also be queried through SRS either directly or indirectly as a database linked to other databases, with the possibility of creating different views, as well as recovering FASTA-format sequences. The tutorial will review the different methods of querying and the syntax that can be used, and will provide an opportunity to see the results of different types of searches.

INTERPRO SEQUENCE ANALYSIS
There is also a sequence search facility using the web-based server of InterProScan, which permits the sequence analysis and characterisation of unknown protein sequences. Nucleotide sequences, both DNA and RNA, can also be used to query InterProScan, where the sequence used in the query is translated in all six frames. Using InterProScan, InterPro takes each sequence and analyses it against one or more of the member databases using preconfigured cut-off thresholds. Following analysis, each result is returned and combined, and then the InterPro entries and sequence signatures are returned to the submitter as a graphical view with links to both InterPro and SRS. The tutorial will familiarise participants with using InterProScan as a tool to annotate and characterise sequences, using an example sequence to search InterPro via InterProScan, and to analyse the results using the InterPro view and the SRS view.

14:15/18:00 - Room 9 Bis (1st Floor)
T4: Computational proteomics
Prof. Colinge Jacques

Proteomics has become an important approach to analyze biological samples and it extensively uses mass spectrometry to identify and characterize proteins. This tutorial will introduce the audience to the central problem of searching mass spectrometry data against a database of proteins. This presentation should stimulate the interest of bioinformatics researchers in other fields and provide a concise though accessible introduction to life scientists. The last part of the tutorial will rapidly cover other important problems in mass spectrometry data analysis such as peptide de novo sequencing, eukaryote genome search and protein quantification and characterization.

Back to Tutorial Index