Batch Searching and Analysis using Protein Prospector

Contents

Description

This document provides a summary of features in Protein Prospector specifically designed for the analysis of large numbers of MS/MS spectra submitted as a batch. The analysis of this type of data is split between two programs in Protein Prospector. First, 'Batch-Tag' searches the data against a database. The results of this search are then summarized in a second program 'Search Compare'. A description of the workings and performance of these two programs was first published in 2005 (Chalkley, R.J. et al. Mol Cell Proteomics (2005) 4 8 1194-1204). An updated description, which describes Expectation value calculation was published in 2008 (Chalkley, R.J. et al.Mol Cell Proteomics (2008) 7 12 2386-2398).

On the Protein Prospector home page there are links to two different Batch-Tag forms. Batch-Tag Web is used when the data needs to be uploaded to the server. If the data is already on the server then you should use the Batch-Tag form. If you are using the public web server you should initially use the Batch-Tag Web form. If you want to re-search the data again with different parameters you should use the Batch-Tag form.


Users

The results of all Batch-Tag searches are saved by Protein Prospector and can be viewed at any time using Search Compare. To prevent all users from having access to everyone else's results, the user must enter a username and password when using Batch-Tag and Search Compare. This allows the user to browse through all of their previous results, and also allows the comparison of search results.


Creating Users

When selecting Batch-Tag, Batch-Tag Web or SearchCompare, a Login page will be brought up, where the user must enter their username. If you do not have a username, you will need to create one by clicking on "Add User". This will bring up a separate page where you must enter a username, password and e-mail address. User names must use all lower case letters and numbers.


Batch-Tag and Batch-Tag Web
Search Fields
Database

There are a number of protein and gene databases available for download from the web. However, they differ widely in terms of number of entries and redundancy. Of the databases available on the website, SwissProt is the smallest but best annotated, Uniprot (the combination of SwissProt and trEMBL) is significantly bigger but less annotated and NCBInr is the biggest but most poorly annotated. Concatenated databases are available that can be used for estimating false discovery rates in results.

Taxonomy

Species limited searches in Protein Prospector programs are performed by means of pre-filtering database entries according to the user-designated species prior to searching. Combinations of species can also be searched (e.g. mammals). It is possible to select more than one option from the taxonomy list, providing one of them is not 'All'. In addition to the list of options in the taxonomy list, any taxonomy identifier from the NCBI taxonomy browser (e.g. Green Plants) can be entered in the 'Taxonomy Names' field in the 'Pre-search Parameters'.

Results Name

This is the name given to the search results when viewed in SearchCompare.

Pre-Search Parameters

These are a set of parameters that allow you to filter the entries in a database that are searched. One can restrict the search by protein MW or pI. One can input a list of species codes to be searched, restrict to proteins with a certain word in their name, specify a list of accession numbers to search or add additional accession numbers from proteins that would not be considered based on other filtering parameters (e.g. you may want to add pig trypsin as a possible match to a search of bacterial proteins).

Precursor Charge Range

Some peaklist generation software does not assign a charge state to precursor ions. If no charge is listed, the data will be searched using the entire list of charge states selected in the Precursor Charge Range field.

Parent Tol

The mass tolerance for the precursor ion can be expressed in parts per million (ppm), Daltons, mmu (millimass units) or % of mass.

Sys Err

Sometimes the calibration of data can contain a systematic mass error. For example, all precursor masses may have errors between +60 to +100ppm. In this situation, searching the data with an 80ppm systematic error and a MS/MS Parent Tolerance of 20ppm would give more reliable results without losing any correct answers. Peptide Results Reports display a histogram of mass errors if there are over 50 peptides matched and will report a mean systematic error. The units of systematic error will be the same as for the Parent Tol.

Frag Tol

The mass tolerance for the fragment ions can be expressed in parts per million (ppm), Daltons, mmu (millimass units) or % of mass.

Instrument (Batch-Tag Web)

This parameter defines the ion types that are searched for when matching the MS/MS fragment masses (see table) The ion types searched for are the same whether it is a MALDI or ESI instrument, but different weighting for ion types are used in the scoring. For those options defined as ‘low res’ the search engine assumes an inability to determine fragment ion charge state, and will consider an ion as potentially being singly- or doubly-charged.

Q-TOF TOF_TOF ION-TRAP FT-ICR-ECD FT-ICR-CID ETD
a
b and y
c and z
c-1 and z+1
a loss
b loss
y loss
Internal
Immonium
Internal loss
d, v, w

Digest

This defines the cleavage specificity assumed when searching the database for MS/MS parent masses. Some combinations of enzymes are available.

Non-specific

It is possible to search for non-specific cleavage at either one of the peptide termini or one can relax the specificity at a specific terminus. Note this will dramatically increase search times, so it is recommended to only use this on a small database or for searching against a list of accession numbers.

Max # of missed cleavages

This defines the maximum number of missed enzyme cleavage sites present in a peptide for it to be considered in the database search.

Constant Modifications

Modifications selected here will be assumed to be always present.

Expectation Calc Method

This option allows you to turn off the expectation value calculation. For searches of very small databases (especially searches by accession number) the expectation values reported become inaccurate. The expectation value calculation involves searching a randomized database before the standard database search, so by turning off Expectation value calculation it can speed up searches.

Variable Mods

Modifications selected here will be searched both as if the modification is present or absent. To select multiple modifications hold down the 'Ctrl' ( '⌘' on a Macintosh ) keys and click the modifications you would like to add. Similarly, to deselect modifications, hold down 'Ctrl' or '⌘' and click on the modification name.

Max. variable modifications per peptide

This maximum restriction on the number of variable modifications per peptide can significantly speed up database searches

Mass Modifications

N.B. This is a function that should only be used when searching a very restricted list of proteins. We recommend only using this on a list of accession numbers of proteins already identified in the sample from an initial search.

It is possible to search for unanticipated mass modifications on any or all amino acids. A mass range for modifications is specified. A given mass modification is unlikely to be an exact integer change, so the Defect option defines an adjustment to the nominal mass shift that allows the user to still search with reasonable mass tolerance restrictions (although we recommend employing less restrictive parent and fragment tolerance restrictions than normal when analyzing data that has high mass accuracy, such as TOF, FT or Orbitrap data). The neutral loss option will look for modifications that are immediately lost upon fragmentation; i.e. it will assume there is no modification when matching fragment ions. This can be useful for identifying labile modifications on peptides such as O-GlcNAcylation or sulfation. For neutral loss modifications, no modification is reported on the peptide sequence, but is indicated by the mass error on the precursor mass.

If the Uncleaved checkbox is selected then mass modifications are not considered at digest cleavage sites.

Mass modifications to the N and C terminus of a peptide can be restricted to peptides at the N or C terminus of the protein by using the appropriate menu selection.

Expectation values reported for this type of search are likely to be inaccurate. Hence, we recommend performing this type of search against a concatenated database, so that it is possible to determine a suitable acceptance threshold.

Matrix Modifications

These are combinations of modifications that can be searched at once, such as any amino acid substitution (Homology) or amino acid substitutions that would result from single base changes in the genome.

Upload Data from File

Batch-Tag accepts peak list data in the form of lists containing m/z, intensity and charge. It will accept peak lists in mgf, dta, mzXML and mzData file formats. There is no need to specify the file format; Protein Prospector will try and automatically recognize it. Compressed files in zip or gz format may also be uploaded. These compressed files may contain many peak list files, allowing the searching of multiple files in one search. The compressed file can also contain raw data from Xcalibur (Thermo), Analyst (ABI) or 4700 Explorer (ABI), if you want to do quantitation analysis. If you want to upload raw data for quantitation analysis, then the raw data file and the corresponding peak list file must have the same name (apart from the filetype suffix). The files must be combined together into a single zip or gz file for upload through BatchTag-Web. Protein Prospector is able to determine charge states and de-isotope fragment ions in peak lists. This can be very beneficial when the data is of sufficient resolution to reliably determine the charge state; e.g. TOF or FT-ICR data. It means that instead of creating an imaginary peak type that has half the m/z of expected fragment ions to look for doubly-charged peaks, it can directly identify the multiply charged peaks. This significantly reduces the chances of random peak matches. Hence, we recommend submitting peak lists that have not been de-isotoped, because if the isotopes have been removed it is impossible to determine the charge state. Of course, for low resolution data, such as ion trap data, charge state determination is usually not possible so the ion type at half m/z has to be created to try to find doubly-charged fragment ions.

If your peak lists are in a format that Batch Tag currently does not accept then e-mail us a small sample file and we will try and incorporate your file format.

Search Job Management

When the search is executed a job status page will be brought up that will report the progress of the search. For a new project Protein Prospector will perform two searches; one for calculating score distributions for expectation value calculation and one for identifying the peptides. Hence, when the progress reaches 100% it will start again at 0% and the search will finish when it reaches 100% for a second time. If you perform a second analysis on a dataset, providing you do not change any parameters that will change the score distribution, it will only need to perform a single search for subsequent analyses in this project.

A search daemon manages search submissions. If more searches are submitted than can be performed at one time they will be lined up and as soon as a search is completed the next one is started. Searches are performed in the order they are submitted. Searches continue and complete independently of whether the web browser is still open.


Search Compare
Report Options

In this section the report type and how it is presented is defined.

Format

Results can either be displayed in HTML format or in a tab-delimited format. The tab-delimited format can be easily copied into a spreadsheet.

Accession Numbers

It is possible to filter the results to only view results for proteins on a list of accession numbers. Conversely, one can remove selected accessions from the list by checking the 'Remove' button.

Multi Sample

This should be highlighted when the peak list searched includes data from several different samples; e.g. MS/MS spectra acquired on a TOF-TOF from several unrelated spots. This will produce a separate summary for each spot rather than combining the results from all spectra. Similarly, if multiple peaklists were submitted for one search, this will split the results by peaklist. It is also possible to filter the results to only show results from one spot or one peaklist using the Spot/Fraction field.

Preferred Species

Often if you do a database search then some of your hits will not be unique to a particular species and the species entry which is displayed can be somewhat random. The preferred species option allows you to enter strings from either the NCBI taxonomy browser or the Swiss Prot controlled species vocabulary to describe the mix of species in your sample. The software will then find the nearest matching in the taxonomy tree from the matches that you have, as long as the matches are equivalent. For example if your sample was a mixture of yeast, ecoli and human proteins and you could enter:

YEAST
ECOLI
HUMAN

If it so happened that a peptide happened to match both a YEAST and an ECOLI protein then the YEAST one would be preferentially displayed.

Min Best Discriminant Score

The discriminant score is a score that is the combination of two measures of the search result. One is the expectation value for the peptide match (a measure of the likelihood that a match is random) and the other is a 'best peptide score', which takes into account the fact that if a protein has been confidently identified in a sample, it is more likely that other peptides will identified from the same protein. As a default this threshold is set to 0 and, for a normal database search, discriminant scores below 0 are generally incorrect, whilst those above 0 are mostly correct. At the top of the search results is a plot of all the discriminant scores. This should contain two distributions: one for incorrect (random) matches and one for correct.

Min Prot Score and Min Pep Score

Batch-Tag uses a simple scoring scheme based on a certain score for each ion type. These two parameters set a minimum quality standard for a spectrum to be accepted. The minimum protein score is typically set higher than the minimum peptide score to require a higher standard if a protein is going to be identified on the basis of a single peptide.

Max EValue Protein and Peptide

An expectation value is a measure of how many times an event is expected to happen at random. An expectation value of 0.1 means that if the search was repeated ten times you would expect one random match.

For a protein to be reported, it has to have a peptide with an expectation value less than the Max EValue Protein threshold. Other peptides from the reported proteins are reported as long as their expectation values are less than the Max EValue peptide threshold.

Best Peptide Only/Best Per Charge/Keep Replicate Peps

If Best Peptide Only is selected Search Compare will only report the best match if the same peptide has been matched multiple times. The Best Per Charge option will retain the best match for each charge state. Keep Replicate Peps will report redundant peptide identifications.

Best Discr Only

For each spectrum the top five scoring matches are saved, but only the one with the best Discriminant score is reported. By deselecting this parameter Search Compare can report results other than the top Discriminant scoring match; i.e. it could potentially report more than one peptide to a given spectrum.

Discr Score Graph

If you select this parameter a graph of the discriminant scores is plotted near the top of the report. This can be useful for selecting an appropriate value for the min best discriminant score parameter. Generally you can leave this selected. However if you are doing a very large quantitation analysis (say a hundred fractions or more) you can significantly reduce the amount of memory used by the program by deselecting this. Thus you could do a report without quantitation analysis to establish the discriminant score limit and then deselect this option before doing the quantitation analysis.

Peptide Composition

The peptide composition option is used along with the composition checkbox to display a column in either the peptide or time report. This column will contain a 1 if the peptide matches the composition options or a 0 if it doesn't. The composition options include amino acids, modifications or mass modifications. Mass modifications should be entered as integers (one per line). If multiple composition options are selected then AND means all the selections have to be present and OR means just one of the selections have to be present. This feature is particularly useful for sorting results in a spreadsheet; e.g. grouping all phosphorylated peptides together.

Report Type

Three options: a protein level report, peptide level report or a time report. The time report lists every MSMS spectrum acquired in order of acquisition, with the match (if there was one). Those spectra for which there was no match will just report the number of peaks submitted.

Report Homologous Proteins

This defines how to deal with homologous proteins. The default is 'interesting' which means a homologous protein will only be reported if there is at least one unique peptide matching to the protein. Occasionally proteins will be reported as homologous when the level of homology may be fairly low (e.g. only two out of ten peptides are identical between proteins).

Report Hits Type

Separated/Merged

The Separated/Merged menu is shown if multiple search results are selected. If you select Separated then the results for each search have their own set of columns in the report. If you select Merged then the results are combined together into a single set of columns. This is useful, for example, for combining CID and ETD results into a single report.

Sort Type

Defines the parameter that you want to use to order the results. The default is 'Discriminant Score' which we believe is the best way to order the results in terms of reliability. Peptide lists can also be sorted by score, start residue in the protein sequence, elution time/spot number or m/z.

Display Unprocessed MS/MS

As a default, in MS-Product presentations of peptide assignments it will only display the processed peaklist that was used in the database searching for making the peptide assignment; i.e. the de-isotoped peaklist with probably only 40 peaks (see Max MSMS Peaks below). Selecting Display Unprocessed MS/MS will tell MS-Product to display the raw peaklist as it was submitted.

Max MSMS Peaks

There is a great variety in the quality of peak picking by software when producing peaklists to be submitted for database searching, with some peaklists being essentially an unprocessed list of hundreds of masses, most of which are noise. Protein Prospector takes the mass range of the peaks submitted, splits it in half, then for most instruments takes only the top 20 peaks in each half of the spectrum to produce a list of 40 peak masses that are used for database searching. 40 peaks has been found to be in most cases the optimal number for tryptic peptides to give specific answers without introducing too many false matches to noise peaks. However, in some cases there may be a benefit to increasing or decreasing this number. For example, if peptides are generally long (maybe an enzyme was used that cuts infrequently) then there will be more real fragments in the spectrum, so increasing this value may be beneficial. Also, if you are analyzing post-translationally modified peptides and want to identify sites of modification, then allowing for more peaks that may assist in site assignment may be important. The maximum number of peaks considered is changed using the Max MSMS Peaks parameter.

Columns to Display

This section defines what columns you want displayed in your results summary. Some of these columns are specific to protein reports and some to peptide reports. Most of the column names are self-explanatory, but below is a definition of some of the less obvious.

Protein Rank

Protein rank within the results. For proteins that are homologous (interesting: see above) to another they will be ranked the same; e.g. 5-2 means that this protein is homologous to protein hit 5, but is the second best match within this set of proteins.

Protein Score

Sum of Batch Tag peptide scores.

Number Unique

Number of different peptides matched to a given protein. Multiple matches of the same peptide at the same or different charge states are defined as a single unique match, but modified versions of the same peptide (e.g. methionine oxidized version) count as separate unique matches.

Peptide Count

This is the number of peptides reported for a given protein whether unique or not. This is only different from the number of unique peptides if you have the Keep Replicate Peptides option selected. Some basic quantitation methods make use of peptide counts.

Best Peptide Score

Batch Tag score of the highest scoring peptide matched to a given protein. This is one of the parameters used for the discriminant scoring.

Best Discriminant Score

Highest discriminant scoring peptide to a given protein.

Peptide, DB Peptide and Mods

If you select the peptide column then the matching peptide is displayed with the modifications in the sequence. If you click on the sequence then an MS-Product report of the match will be displayed. The DB Peptide and Mods columns can be used to display the database peptide and modifications in separate columns which can be useful for comparing the match with that from other software packages.

Error

Error between observed and theoretical precursor mass. Units will be the same as those used for the search; i.e. will be parts per million if the Parent MS/MS tolerance was defined in ppm.

Number of Unmatched Peaks

Number of submitted MS/MS fragment peak masses that are not explained by the peptide match.

Peptide Rank

Prior to the discriminant scoring whether this match was the top, 2nd best match, etc to the particular spectrum.

Score

Score assigned by Batch Tag where points are given for every peak matched to a theoretical fragment ion from a peptide. Different numbers of points are given depending on the ion type and the instrument. For a more complete explanation see Chalkley, R.J. et al.

Score Difference

Difference in Batch Tag score between this match and the sixth best match to this spectrum (sixth match is assumed to be an incorrect, random match). This is a crude measure of how much better than random a give match is.

Discriminant Score

More reliable scoring system than the Batch Tag score, this is calculated by combining 'Best Peptide Score' and Expectation value.

# in DB

Reports how many protein entries in the database contain this same peptide sequence (allowing for substitutions of Leu and Ile).

Mass Mod

If have selected mass modifications in your peptide this column will contain the integer mass modification. Also a mass modification histogram will be displayed near the top of the report.

Time

Spectrum number: This will report four columns: Fraction = peak list file peptide was identified from (somewhat redundant option in current public version as only one file can be searched at once); RT = LC retention time / Spot number; R = run number (only relevant to TOF-TOF data); # = MS/MS acquisition number on this spot/retention time. For LC-MS/MS data, if the software that created the peak list created duplicate peak lists for a given spectrum with different charge states, then this will reflected in the # column.

Protein Length

The number of amino acids in the protein.

Links

Deselecting this will mean that all links within the report; e.g. clicking on a protein in the protein summary to get a peptide summary for that protein, will be disabled. This significantly increases the speed in which the page is displayed.

Checkboxes

If you select checkboxes then the report will contain a column of checkboxes which can be used to select or eliminate results from the report. For example you might want to manually look at the quantitation results and eliminate obvious outliers. Another possibility would be to just include peptides in the report with particular modifications. For example you could just include phosphorylated peptides in the report. A small form is displayed at the top of the report to facilitate this.

Raw Data / Quantitation

If the user has uploaded raw data files from one of the supported data types (Thermo .raw; ABI .wiff or ABI .t2d) along with the peaklists, then Search Compare can read the raw data to extract quantitative information. For details about how to upload raw data for quantitation see here. If ‘Raw Type’ of MS Presursor is selected it will report information about the precursor peak; whereas if ‘Raw Type’ of Quantitation is selected it will calculate quantitation assuming isotopic labeling with the option selected in the ‘Quantitation’ pull-down menu.

Search Compare can calculate protein-level median, interquartile range (IQR) (the values that bracket the middle 50% of measurements), mean, +/- any number of standard deviations from the mean and how many peptides were used for quantitation (num). Intensities or peak areas can be used, and a minimum intensity or area threshold may be applied for a peak to be used for quantitation. For MS-based quantitation it is possible to average together scans over a period around the time a peak was selected for MSMS, which will give better ion statistics and accuracy. It is necessary to specify an approximate resolution of the data for the purposes of peak detection. For isotopic labeling strategies, if it is known that there was not full incorporation of the isotope into the heavy reagent, this can be compensated for.

Results Reports

At the top of each search result there will be a histogram plot of discriminant scores. Within this plot there should be two distributions, one for the correct answers and one for the incorrect answers. The distribution for the incorrect answers will always be much larger than the correct because for each spectrum Batch Tag is saving the top five results, so even if a correct answer is determined for every spectrum the incorrect distribution will be four times bigger.

Protein Report Links

After the discriminant score histogram there is a button 'Batch Tag of Listed Accession Numbers'. This allows you to do a subsequent database search of only the proteins that were listed in this report. Clicking on this will launch a new Batch Tag page where the accession numbers of the proteins identified in the first search are input into a new search page. This is a good approach for looking for post-translational modifications or peptides formed by enzyme non-specific cleavages of proteins that you have already decided are present.

If a search was performed against a concatenated database, it will report the number of proteins identified to the target database and separately the number matched to decoy sequences. Matches to decoy proteins are displayed with a negative accession number.

Clicking on a number in the Rank column will give a peptide report for the protein selected, including a plot of the sequence coverage observed. Clicking on the protein accession number will link to the relevant protein database entry.

Peptide Report Links

In the peptide report, after the histogram of discriminant scores, if there are more than 50 peptide identifications then there is a second histogram that plots the mass errors for all peptide matches above the minimum discriminant score. It also reports the mean error and standard deviations. (Note: for a TOFTOF Multi Sample Report this will be the mean across all spots). The mean error can be used as the 'systematic error' when re-searching data to improve the mass accuracy.

Clicking on the hit number will link to a peptide report including a sequence coverage map. Clicking on the peptide sequence will link to an MS-Product type report, which will display the match to the fragment ions. This report will give a visual representation of the peak matches: red peaks were matched, black peaks were not. It is possible to zoom in and out in this panel. Only ion types that are appropriate to the instrument type will be used. As a default only a, b, c, y and z ions are labeled, but clicking on 'loss' will label water and ammonia loss ions, clicking on 'imm' will label immonium ions, and clicking on 'Int' will label internal ions. Clicking on 'mass' will label the masses of those peaks not matched. Underneath this is a table listing all the peaks submitted and their corresponding matches with errors. Next will be a list of theoretical fragment ions from the peptide, where ions that fragments that were observed are in red. Finally, there is a table that lists all potential fragment ions in order of increasing mass. If a mass modification has been assigned to the peptide, then clicking on the mass in the sequence at the top of the page will link to Unimod, where it will display known modifications with nominally the same mass.

Clicking on a value in the RT column in the Peptide Report will open a link to MS-Tag From File that allows you to search this one spectrum.

For quantitative studies, a boxplot is displayed for each protein, visualizing the quantitation results. A mean and the interquartile range are displayed. Datapoints from peptides that are unique to the protein reported are in red, while peptides that are shared among database entries are in blue. In the peptide report, for ratios where one of the datapoints is below the threshold, the ratio will be reported with a greater than or less than sign to indicate the lack of accuracy of the measurement. If a particular peak is not present at all the ratio will be reported as ‘high’ or ‘low’. In quantitation reports, when clicking on the precursor m/z this will link to the raw data that was used for the quantitation measurement.


MS-Tag from File

This is a version of MS-Tag that can search a single spectrum that is part of a set of peak lists in a file. This allows the user to see the other matches to a particular spectrum based solely on the Batch-Tag scoring system. Also, as only a single file is being searched it is possible to use looser search parameters (e.g. allow for multiple modifications, more missed cleavages, wider mass tolerance) that could be prohibitively slow when searching a whole dataset.

When a peak list is submitted to Batch-Tag if there are more than a certain number of peaks in the list (the default threshold is 40), then Prospector splits the mass range into two and takes the 20 most intense peaks in each half of the spectrum as the peak list for searching. In MS-Tag from File you can change this number. This might be useful, for example, if you think a peptide may be phosphorylated and you want to look for low intensity phosphorylated fragment ions.


Results Management

Batch-Tag search results are stored in a database, so previous search results can be viewed and re-interpreted at a later date. This also allows comparison of search results against each other. Results are stored by user, so you will only be able to see your own results. Searches of one set of data are grouped together into projects, then within a project the data can be searched with different parameters to create multiple result sets from one dataset. Using the program 'Results Management' it is possible to delete unwanted results files or whole projects.


Search Table

The Search Table will display the number of searches currently taking place, and the stage/progress of each of these searches. For searches belonging to the user, they will also be able to see search names and there will be a link to the progress for the individual search.