Friday, April 26, 2013

Common ptifalls of FDR estimation part three

The third pitfall is also caused due to the over-emphasis on sensitivity.

There is another trend in database search software to re-score the peptide identification results by using machine learning. The idea is straightforward: After the search, we know what the decoy hits are. The algorithm should take advantage of it, and retrain the parameters of the scoring function to get rid of the decoy hits. With this effort, it will get rid of a lot of the target false hits as well.

The method is valid, except that it may cause FDR underestimation. This is because the target false hits are unknown to the machine learning algorithm. Therefore, there is a risk that the machine learning algorithm removes more decoy hits than the target false hits.
This overfit risk is well known in machine learning. A machine learning expert can reduce the risk but can never get rid of it. 

The solution to this pitfall number 3 is trickier.

The first suggestion: don’t use it. The philosophy here is that judges cannot be players. If we want to use the decoy for result validate, the decoy information should never be released to the search algorithm.

If this re-scoring method must be used due to the low-performance of some database search software, it should only be used for very large dataset to reduce the risk of over-fit.

Perhaps the best solution is the third one. That is, the retraining of the score parameters should be done for each different instrument type, instead of each dataset. This will gain much of the benefit provided by machine learning, but without the problem of over-fitting. Indeed, this third approach is what we do in the PEAKS DB algorithm.

*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.

Monday, April 22, 2013

Common ptifalls of FDR estimation part two

The second pitfall of the traditional target-decoy strategy is caused by another popular technique used to increase the peptide identification sensitivity.  

The idea is clever: if a weakly identified peptide happens to be on a highly-confident protein, then the peptide is likely to be correct regardless of its low score. So, to increase the sensitivity, the software can add a score bonus to each peptide on a multiple-hit protein. Indeed, this protein bonus will save some weak true hits, but it will save some weak false hits at the same time. The bigger problem is that the target database will provide more multiple-hit proteins than the decoy. As a result, more weak false hits will be saved from the target database. This will cause the FDR underestimation.
decoy fusion approach can solve this problem effectively.

Because the target and decoy sequences are concatenated into a single protein sequence, when a protein bonus is added to the multiple-hit proteins, the same bonus will be added to the target and decoy hits equally. So, weak false hits are saved with approximately equal probabilities in the target and decoy. This recreates the balance and provides accurate FDR estimation.

By using the decoy fusion as the validation method, we can safely apply the protein bonus. We get the sensitivity, but did not compromise  the FDR estimation.

*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.

Friday, April 19, 2013

Common ptifalls of FDR estimation part one

Today’s most widely used method for FDR estimation is the target-decoy strategy. This is a well-established method in statistics and started to be used in proteomics around 2007.

In this approach, a decoy database that contains the same number of proteins as the target database are searched together by the database search engine to identify peptides. The blue colors indicate the target hits and the orange colors indicate the decoy hits, the squares are the false hits, and circles are true hits. 
The decoy proteins are randomly generated so that any decoy hit is supposedly a false hit. Since the search engine doesn’t know which sequences are from target and which are from decoy, when it makes a mistake, the mistake falls in the target and decoy databases with equal probability. Thus, the total number of false target hits can be approximated by the number of decoy hits in the final result. And the FDR can be estimated by the ratio between the numbers of decoy hits and the number of target hits. 

The target-decoy strategy is a powerful method for FDR estimation. However, as we will discover in the next little while, such a powerful method must be used with caution to avoid FDR underestimation. 

The first pitfall in the use of target-decoy approach for FDR estimation is due to the so-called multiple round search strategy in today’s database search software. 

This multi-round search was popularized by the X!Tandem program published in 2004, in order to speed up the computation. The first round uses a fast but less sensitive search method to quickly identify a shortlist of proteins from the large database. Then, the second round uses a more sensitive but slower search method to identify peptides, but only from the short list of proteins. This effectively speeds up the search without sacrificing too much sensitivity. Indeed, X!Tandem is one of the fastest search algorithm used today.

However, as pointed out by a paper published in JPR in 2010, this multiple-round search strategy screws up the target-decoy estimation of the FDR. The reason is that after the first round, there will be more target proteins than the decoy in the short list. Thus, if the second round search makes a mistake, the mistake will be more likely in the target proteins. So, we will end up with fewer decoy hits than the actual false target hits. This causes the FDR underestimation.
The JPR paper in 2010 provided a fix to this problem. But a year later, in another JPR paper, Bern and
Kil pointed out that the fix was wrong, and proposed a different fix that required the change of the search engine’s algorithm. This shows that the FDR estimation is very tricky, even the experts can sometimes get it wrong. 

In PEAKS, we used a new approach, called decoy fusion to solve this problem. 

Instead of mixing the target and decoy databases, we append a decoy sequence to each target protein.

So, after the fast search round, the protein shortlist will still contain the same length of target and decoy sequences. And the false hits of the second round will have the equal chance to be from the target and decoy sequences. This recreates the balance and can accurately estimate the FDR in the multiple-round search setting.

*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.

Monday, April 15, 2013

Use new / non-standard amino acids in PEAKS

Some researchers are looking for unknown amino acids in their proteomic studies. This requires the software to be able to take new theoretical AA into account. Although, PEAKS internally only use the 20 common amino acids residues, here is a trick we can use to add new AAs.

The general concept is to add this new amino acid X as a PTM of an existing amino acid Y. The existing amino acid Y should be picked so that the mass difference between X and Y is distinguishable from the mass of all PTMs in the search. In the analysis, on top of the normal PTMs you may set, set this special PTM as a variable modification. In the result, this new amino acid X will appear as a variable modification of Y.

Here is an example. Let's add the amino acid Pyrrolysine to PEAKS. Open the PTM configuration by click the menu "Window"-->"Configuration", then select the PTM tab. Click the "New" button on the lower right corner of the dialog. I am going to configure it as a modification of Alanine. The mass difference is 184.1212Da.
Click "OK" to complete the configuration. When performing a PEAKS search, the new PTM, Pyrrolysine, can then be selected in the "Customized" tab of the "PTM Options" dialog.
Use the same trick, we can substitute or remove an amino acid by setting it as a fixed modification in the search.

Wednesday, April 10, 2013

Why everyone should de novo?

When I was doing some house cleaning on old documents, I found a few slides created 5 years ago about why de novo sequencing should be included in all research workflows. Most of these things still stay true today and it aligns very well with PEAKS development. I'd like to share them here.

“The organism I’m studying isn’t in any of the databases.”

De novo sequencing is the only way to go.

“I need more confidence in the peptides found by MS/MS ion search”

Use an orthogonal approach like sequence tag searching or hybrid searching to validate the results

“I can only explain <10% of my data. I need to boost my search engine’s performance.”

Use two or more unique search engines. Then try de novo sequencing.

“I think I’m still missing some peptides, maybe because of with PTM”

Use de novo to find a partial sequence; matching this to a database will highlight peptides of interest.

“I am still missing some peptides even after trying all PTM. But the data is great!”

De novo + peptide homology search like SPIDER or BLAST.

“My lab creates too much data for de novo sequencing”

Use PEAKS. It’s fast enough.

Saturday, April 6, 2013

Update: 3.4 million MS2 spectrum project, PEAKS DB completed!

Both PEAKS de novo and PEAKS DB search were finished successfully on the 3.4 million MS2 dataset. The computer has two Xeon hex-core CPUs and 32GB RAM running Windows 7 Pro 64 bit OS. The search completed in a little over a week.
PEAKS de novo reported over 2 million de novo peptides that have their ALC score greater than or equal to 50%.

PEAKS DB reported over 1 million PSMs at 1% FDR.

Be sure to check out the protein coverage view of one of the top score proteins. It can help demonstrate that a protein is pretty much fully covered.

Tuesday, April 2, 2013

A few quick issues of running PEAKS on OS X and Linux

I have played PEAKS as a viewer on a Mac OS X for a while and do notice some minor issues. That's expected as PEAKS on OS X is not officially supported!

No instrument raw file loading ability. This could probably never be fixed unless the instrument vendors port their libraries to OS X. So in short, PEAKS can only read text formats (mgf, mzxml, mzml, pkl, dta, etc) and of course PEAKS projects on OS X.

Images on the summary view is broken. This seems to be a coding issue related to the path (file path on Windows and OS X are different). I do manage to get a workaround though. On the summary view, click "Notes" button on the top of the view, a text editor will show up. Type in the following and click "OK".
<a href="">go</a>
A "go" link will be displayed in the "Notes" section of the summary view. Click on the link, the summary view will be correctly displayed in your default web browser.

Vertical tabs are too small and the text on them are not visible. Well, not sure how to workaround this. But when you mouse over the tabs, the tooltips do work.

I will keep playing PEAKS on OS X when I have chance. Please comment if you find other interesting issues :D