ECCB

Tutorials

Wednesday
September 28, 2005

14:15/18:00 - Room 9 Bis (1st Floor)
T4: Computational proteomics
Prof. Colinge Jacques

Proteomics has become an important approach to analyze biological samples and it extensively uses mass spectrometry to identify and characterize proteins. This tutorial will introduce the audience to the central problem of searching mass spectrometry data against a database of proteins. This presentation should stimulate the interest of bioinformatics researchers in other fields and provide a concise though accessible introduction to life scientists. The last part of the tutorial will rapidly cover other important problems in mass spectrometry data analysis such as peptide de novo sequencing, eukaryote genome search and protein quantification and characterization.

Synopsis:

Part 1 (20 min): introduce the main problems and technologies in proteomics.
Part 2 (40 min): Peptide mass fingerprinting is presented as a first – simple – type of data to analyze and also as a mean to introduce the basic concepts of database searching with mass spectrometry (MS) data.
Part 3 (20 min): Raw spectrum processing is rapidly covered to actually link the somewhat abstract mass lists used for searching databases with the signal generated by the MS instruments.
Part 4 (60 min): Tandem mass spectrometry (MS/MS) technologies are introduced and MS/MS database searching is discussed in details.
Part5 (40 min): Although the main focus of the tutorial is on database searching, other current and important topics in MS-related bioinformatics are presented such as peptide de novo sequencing, protein quantification and differential proteomics. The material introduced for database searching allows the audience to realize the importance of these topics and the challenges they offer for bioinformatics.

Part 1: Introduction to proteomics

We start by introducing the main problems in proteomics: identify proteins in a sample, characterize modified proteins, compare samples and quantify proteins.

We point out the difficulty of excessively complex samples with high dynamic range of protein concentrations. Then we introduce the main technologies for reducing initial sample complexity: liquid chromatography and gels.

We then rapidly introduce the concept of mass spectrometry as a technology to measure masses that constitute specific data sets, which may be used for obtaining information about the proteins.

Top-down versus bottom-up proteomics: explain the two approaches, respective advantages for specific problems. In particular protein characterization (top-down) and sample analysis (bottom-up) are used as typical problems to illustrate the two opposite approaches.

For the rest of the tutorial, bottom-up proteomics with enzymatic digestion is the standard case we consider. Explain classical strategies to analyze samples: protein separation and late digestion (gel or LC) versus shot-gun peptide sequencing and early digestion.

Put the presented material of this first part in perspective from the point of view of data processing. Make obvious the central role of the protein/peptide identification problem.

Part 2: Peptide mass fingerprinting and MALDI instruments

On the basis of the general context presented in Part 1, we introduce and detail a first proteomics method: peptide mass fingerprinting (PMF).

Present the combination 2D gels + MALDI-TOF mass spectrometry. Explain that enzymatic, typically tryptic, peptide masses certainly constitute a data set specific to the protein.

Show a first example with a spectrum and a database search result. Mention the problem of peak detection in the spectrum, which will be detailed later in Part 3.

Explain a basic algorithm for searching PMF data against a database of protein sequences. Introduce the notion of scoring function aimed at measuring the correlation between theoretical and experimental spectra, the theoretical spectrum being the list of peptide masses computed from the protein sequences. First scoring function: “shared peak count”, i.e. number of masses in common between theoretical and experimental spectra, given a mass error tolerance.

Show examples of database searches with shared peak count. Comment the results: impact of mass precision, database size, and scoring function; importance of estimating p-values.

Before we examine some important scoring functions, we describe the problem of database searching with more details. Namely, we explain the extra difficulties caused by missed cleavages, variable modifications, and imperfect instrument calibration. For each problem we explain the classical solution: missed cleavages -> introduce them in the theoretical spectrum, variable modifications -> generate all possible peptide masses, calibration -> re-calibrate by using the theoretical masses.

We conclude Part 2 by detailing MOWSE, ProFound, MSA and OLAV-PMF scoring functions.

MOWSE is used by the commercial package MASCOT and it is based on a simple model of “typical” peptide masses probabilities, which is computed by averaging theoretical spectra of a large population of proteins.

ProFound is a commercial package that score spectrum matches by using a Bayesian combinatorial model of “how many masses are expected to be found in the experimental spectrum”.

MSA is a successful heuristic model that explicitly uses TOF mass spectrometer properties to distinguish between spurious and correct PMF identification. It has been developed at Max Plank Institute, Berlin.

OLAV-PMF is a scoring function that is based on a likelihood ratio of two statistical models (correct and random identification) that try to make use of several properties of correct identifications such as protein coverage, amino acid composition, and peptide modifications.

State the importance of statistical modeling.

Part 3: Peak detection

In this third part we rapidly detail the problem of peak detection as illustrated by MALDI-TOF spectra used for PMF.

Examples of spectra to present electronic and chemical noises, baseline, isotope peaks, and overlapping peptide signals.

Explain the main strategies for performing peak detection and present a very simple one as well as an advanced one. Give examples showing the practical difficulty to write very good programs.

Part 4: Complex samples and tandem mass spectrometry

Database sizes and sample complexity may limit the usage of PMF. The solution is to get more information on the peptides such as we can identify them directly. Tandem mass spectrometry is a manner to obtain such additional information via fragmentation. Explain the principle of fragmentation in a very general way.

Present a schematic abstract mass spectrometer with ion source, fragmentation cell and mass analyzer. Present different technologies (collision induced fragmentation, post-/in-source decay).

Explain the benefit of having more information on peptides. On consequence is that peptide separation before mass spectrometry analysis is possible: LC-MS is introduced with electro-spray ionization (ESI) and LC-MALDI.

Explain fragmentation more precisely and introduce the various types of ions. Each type of instrument and parameters setting yield certain ions only. Ions actually generated also depend on the properties of the peptide. These observations together are the basis for designing MS/MS scoring functions. Modified peptides yield modified MS/MS spectra.

Before we examine important scoring functions, we list typical experimental conditions and the type of data generated.

Scoring functions: MASCOT, SEQUEST, post-processing of SEQUEST, OMSSA, OLAV.

MASCOT, a commercial program, uses an adapted MOWSE score to score peptide identifications. MASCOT includes spectrum pre-processing to determine noise level.

SEQUEST, another commercial program, compares the experimental and theoretical spectrum by first generating an artificial spectrum from the theoretical mass lists and then by comparing the two spectra directly. A second step involves the computation of a heuristic score for the top-ranked peptides identified during the first step.

Many efforts have been made to design filters for SEQUEST results in order to eliminate false positive identifications. The standard design of such post-processing tools will be presented.

OLAV is a scoring function that models various properties of correct versus random matches in the framework of likelihood ratios and naïve Bayesian classification. It is implemented in the commercial program Phenyx.

Reiterate the importance of statistical modeling and p-value computations.

Part 5: Other problems, other approaches

We list several problems which are of great importance in proteomics today. For each of them, we rapidly describe the analytical problem, the type of data and bioinformatics challenges. The notions introduced for database searching and proteomics in general allow the audience to easily understand the problems, their importance and the associated bioinformatics.

The problems we will present are:

Genome not sequenced or limited database
Peptide de novo sequencing: directly infer the peptide sequence from the tandem mass spectrum (no database search) and use the predicted sequence per se or for homology searches in closely related organisms.
Homology searches: search against closely related organisms sequences.

-Genome annotation via proteomics

Gene predictions and annotation by homology have their limits and there is a demonstrated potential to gain extra information about genes and their structures by using experimental proteomics data. We will explain how this can be done on a large-scale and give examples to actually demonstrate the potential of the approach.

-Protein characterization

Top-down analyses: find modification sites, typically phosphorylations, by analyzing the intact protein.
Identification of modified peptides. Database search programs can be used to find modified peptides in an iterative way.
Peptide de novo sequencing of modified peptides. Besides its ability to identify new peptides, peptide de novo sequencing can also find modified peptides.

-Quantification and differential proteomics

2D Gels and image analysis. Mention challenging problems in 2D gel images processing and comparisons.
Semi-quantitative methods: identification scores or the number of identified peptides can be used to infer semi-quantitative and relative information about protein concentration.
Stable isotope labeling methods allow to obtain precise quantitative information on protein concentrations and thus to compare samples.
Important individual variations calls for adapted differential expression analysis methods.
Profiling: direct mass spectrometry measurements of samples allow detecting reliable differential masses between sample classes.

-Protein structure prediction

Mass spectrometry may be used to assist protein structure prediction by detecting di-sulfur bonds for instance.

-Proteomics data integration

From laboratory data to identification results via quantification: Due to possibly complex sample preparation it is especially important in proteomics to be able to make use of intermediary laboratory information such as chromatography profiles. Also, since proteins are elaborated products of a cascade of processes in the cell, it is often necessary to relate proteomics analyses results to known facts concerning gene structure (alternative splice forms), enzymes (degradation products) and literature.

Back to Tutorial Index