The third pitfall is also caused due to the over-emphasis on sensitivity.
There is another trend in database search software to re-score the peptide identification results by using machine learning. The idea is straightforward: After the search, we know what the decoy hits are. The algorithm should take advantage of it, and retrain the parameters of the scoring function to get rid of the decoy hits. With this effort, it will get rid of a lot of the target false hits as well.
The method is valid, except that it may cause FDR underestimation. This is because the target false hits are unknown to the machine learning algorithm. Therefore, there is a risk that the machine learning algorithm removes more decoy hits than the target false hits.
This overfit risk is well known in machine learning. A machine learning expert can reduce the risk but can never get rid of it.
The solution to this pitfall number 3 is trickier.
The first suggestion: don’t use it. The philosophy here is that judges cannot be players. If we want to use the decoy for result validate, the decoy information should never be released to the search algorithm.
If this re-scoring method must be used due to the low-performance of some database search software, it should only be used for very large dataset to reduce the risk of over-fit.
Perhaps the best solution is the third one. That is, the retraining of the score parameters should be done for each different instrument type, instead of each dataset. This will gain much of the benefit provided by machine learning, but without the problem of over-fitting. Indeed, this third approach is what we do in the PEAKS DB algorithm.
*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.
There is another trend in database search software to re-score the peptide identification results by using machine learning. The idea is straightforward: After the search, we know what the decoy hits are. The algorithm should take advantage of it, and retrain the parameters of the scoring function to get rid of the decoy hits. With this effort, it will get rid of a lot of the target false hits as well.
The method is valid, except that it may cause FDR underestimation. This is because the target false hits are unknown to the machine learning algorithm. Therefore, there is a risk that the machine learning algorithm removes more decoy hits than the target false hits.
This overfit risk is well known in machine learning. A machine learning expert can reduce the risk but can never get rid of it.
The solution to this pitfall number 3 is trickier.
The first suggestion: don’t use it. The philosophy here is that judges cannot be players. If we want to use the decoy for result validate, the decoy information should never be released to the search algorithm.
If this re-scoring method must be used due to the low-performance of some database search software, it should only be used for very large dataset to reduce the risk of over-fit.
Perhaps the best solution is the third one. That is, the retraining of the score parameters should be done for each different instrument type, instead of each dataset. This will gain much of the benefit provided by machine learning, but without the problem of over-fitting. Indeed, this third approach is what we do in the PEAKS DB algorithm.
*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.