Monday, December 23, 2013

Merry Christmas and Happy New Year!

2013 will come to the end in a few days. It's been a great year for us, the PEAKS team, at BSI and we want to take this opportunity to wish everyone a Merry Christmas and a Happy New Year!



Wednesday, December 18, 2013

Navigate through your LCMS data in PEAKS

PEAKS 7 is released with a re-engineered LCMS view. Users may use this integrated and interactive view to navigate and dig into the data and result easily. In this post, I am going to talk about the controls and functionalities we have built into this view.

The LCMS view can be found on any opened data nodes, identification results and label free quantification results. It contains four main components: the main heatmap view, the control panel, the TIC and the thumbnails.


Heatmap View

In the heatmap, we use the color to represent the strength of the LC-MS signals. The darker the color the stronger the signal. The x-axis is the m/z and the y-axis is the retention time.

Navigation control

The operations to zoom in/zoom out and move the view is very similar to Google Map. Scroll the mouse wheel forward to zoom in and scroll backward to zoom out. Drag the mouse while holding the left mouse button to move the view. In some situation, you may want to zoom only on one axis. This can be achieved by scrolling the mouse wheel while the mouse cursor is on the RT axis or the m/z axis. You can also drag the mouse holding the right mouse button to quickly zoom into that selected area if you know exactly where you want to look at.

Adjust the contrast

Scrolling the mouse wheel while holding the CTRL key will adjust the contrast of the heatmap. This is particularly useful when looking at less intense area that the default contrast cannot visually distinguish the signal strength. Here is an example. The heatmap is zoomed into a very small region. The image on the left uses the contrast works better when zooming all the way out and viewing the entire heatmap. But in this local region, the colors are all the same. You cannot visually distinguish the strength of the signal. The image on the right displays the same region with the contrast adjusted.


Control Panel

The control panel is on the upper right corner of the heatmap view. It provides information such as the m/z and RT of the mouse cursor location, along with many other functions associated with the heatmap.

Intensity display cut-off

The slider (0) can be used to adjust the intensity cut-off. Any peaks with an intensity less than this cut-off will not be displayed. This is very useful when you want to focus on the real signal and get a much cleaner view. This slider only affect the heatmap display. We use a much sophisticate algorithm to do the noise removal in our feature detection.


The buttons

The back and forward buttons (1) provide an easy way to restore the view to a previous state. PEAKS remembers all the operations you have done to the view, including zooming, moving and changing contrast.

The 1:1 button (2) will immediately zoom out in the view to display the full heatmap. You can also invoke this function by double clicking your left mouse button anywhere in the heatmap.

The 3D button (3) toggles the view between heatmap and 3D mode. While the heatmap mode is sufficient in most cases, 3D mode may provide a direct visualization on the signal intensity in some situations. Below is an example displaying the same data in different modes. While in the 3D mode, most of the navigation controls are the same as the ones mentioned previously, there is one minor difference. You are not able to change the contrast in 3D mode, instead, holding CTRL key while scrolling the mouse wheel will change the height of the peaks.

 
The export button (4) is self-explanatory. Many images in this post are created this way. It supports over-sampling up to 8x, which will generate a much bigger image with smooth font for print.

The fold button (5) toggles the control panel between the full panel mode and mini panel mode. The mini panel can save you some screen space especially on old, low resolution monitors.

The synchronize button (6) will synchronize the m/z and RT position across all samples in the result. This is very useful when you are examining a peptide feature and wondering what the same area will look like in other samples. Instead of manually navigate to the same m/z and RT region in a second sample, you only need to make sure that this button is selected and then change to the sample you want to look at. PEAKS will automatically navigate to the same area with the same zoom level and contrast.

The help button (7) displays a small dialog to provide a quick help on the navigation controls.

Jump to an area quickly

The text box (8) can be used to jump to a m/z and RT position quickly if know the numbers. The format is m/z white space RT (e.g. 420.83 26.25). PEAKS will center the view on the provide coordinates and determine an appropriate zoom level.

Display options

In the control panel, there is a group of checkboxes (9) that controls whether or not to overlay certain information on the heatmap. Depending on the result type, it may include features, MS/MS spectra, identified peptides and de novo only tags.

The following example is from a PEAKS DB result with all the checkboxes selected. The pink circle represents the center of a peptide feature. The empty blue square indicates a MS/MS spectra. The solid blue square means that the MS/MS spectra has a confident peptide identification passed the score filter. The solid yellow square means that although the MS/MS spectra does not have a confident identification, it produced a de novo only tag.


Mouse over the circles and squares will display more information. Clicking on the squares will pop up the spectrum alignment view and you can examine the MS/MS spectra directly.


In the label free quantification result, you may see two type of pink circles in the heatmap view, solid and empty. They all mean that PEAKS has detected a feature there, the solid circle indicates that the feature has passed the filters. Clicking on a pink circle will highlight the approximate boundary of that feature. You can highlight multiple features by holding the CTRL key when clicking the mouse button.


TIC

PEAKS displays the TIC beside the heatmap RT axis. You can zoom in/out on the RT axis by scroll the mouse wheel on the TIC. When zoomed in, PEAKS will highlight the chunk of RT that is being displayed in the heatmap view. You can drag the boundary of the highlighted portion to adjust the zoom level or you can drag the middle of the highlighted portion of move the heatmap content along the RT axis.


Thumbnails

The thumbnails are small, un-zoomable version of the heatmaps. It provides an overall preview of the sample. The contrast is connected to the corresponding heatmap, so changes to the contrast of a heatmap also apply to the thumbnail. PEAKS displays a rectangle on the thumbnail to highlight the area current viewed in the heatmap. You can adjust the area by scrolling the mouse wheel on the thumbnail or drag the borders, effectively, it zooms in/out in the heatmap. You can also drag the middle of the rectangle to quickly move the viewing area in the heatmap.


If the result related to multiple samples, all the thumbnails will be listed here. You can quickly switch between samples by a simple clicking on the sample thumbnail you want to look at.


Monday, December 2, 2013

Boost your analysis speed with PEAKS 7

New mass spectrometry instruments come out every year with higher resolution and/or scan rate. This causes an increasing amount of data generated per hour. While the accuracy and sensitivity of the software tools are critical, many researchers have the need to complete the analysis sooner without compromising the accuracy or sensitivity.

In PEAKS 7, we have re-engineered many parts of the algorithms to make them more efficient.
"The increase in speed [of PEAKS 7] is quite remarkable."
Paul Taylor, Sick Kids Hospital
In this post, we will use an internal test to illustrate the dramatic speed improvements our users love.

Test data and search parameters

12 files generated from Thermo Q-Exactive instrument was used. Both MS and MS/MS were acquired on high resolution. In total, there are more than 100,000 MS/MS spectrum. The detail search parameters is as follow.


Test hardware and operating system

Mac mini late 2012 version. Intel i7-3720QM processor at 2.6GHz. 16GB RAM. Windows 7 Professional Edition 64 bit. The Windows OS is installed using Boot Camp.

The test and the result

PEAKS 6 and PEAKS 7 were installed on the same Mac mini computer. The computer was rebooted between analysis.

First off, we want to test our much loved de novo sequencing speed improvement and here is the result. PEAKS 7 de novo is more than five time faster!
The complete analysis in PEAKS can identify database peptides, novel peptides, PTMs and mutated peptides in one go. We use the total running time in version 6 as the baseline and see how much shorter researchers can finish the same analysis.
The same search completed in just one quarter of what it takes in the previous version, with the same level of accuracy and sensitivity.




Thursday, November 21, 2013

PEAKS 7 Released

We are happy to announce the release of PEAKS 7.
The focus of this release is de novo sequencing improvements and a new label free quantification algorithm. The software is available for download here.

Aside from many improvements in PEAKS itself, we have been collaborating with Proteome Software and the Skyline group to ensure that PEAKS results can be imported there. Please check out the New Features in PEAKS 7 page for a long list of excitements.

We have also put up a web page talking about protein quantification. You can read it here.

Wednesday, October 16, 2013

PEAKS 7 Beta Starts Today

The PEAKS team was extremely busy in the past several weeks and finally today we are able to put PEAKS 7 beta in our invited testers' hands.


The biggest new feature in PEAKS 7 is the new label free quantification algorithm, which will replace the old algorithm in the optional quantification module. Our HUPO 2013 poster, PEAKS - A Software Tool for Shotgun Label Free Proteomics with High Sensitivity and High Accuracy, showed some preliminary results. To help our user navigating through the data and results for manual inspection, we have re-engineered the 2D heatmap and 3D view to provide a smooth, Google Map-like experience.

PEAKS is already the best commercial de novo sequencing solution. In PEAKS 7, we make it even better by tackling two fronts: improving accuracy and providing result validation/filtration. This poster shows that when the peptides are fragmented using different fragmentation methods the accuracy of de novo sequencing can be significantly improved. PEAKS 7 also includes tools and guidelines for de novo result filtering and validation, described in this poster.

There are many more improvements in PEAKS 7, we will unveil them at release, shall we?

We may expand our tester base slightly in the next few weeks depending on the stability of the beta build. If you are willing to get early access to PEAKS 7, please comment on the post or send email to peaksbeta@bioinfor.com. Although we cannot guarantee your early access, we will make sure that you get notified as soon as PEAKS 7 released.

Wednesday, October 2, 2013

Local confidence score and de novo tags

PEAKS de novo sequencing not only produces accurate de novo sequences from the spectrum, it also provides confidence score at amino acid level.

In the de novo table, when you hover the mouse cursor on a de novo sequence, the following window will show up to display the local confidence score for each amino acid.

The sequence is color coded. Red represents a very high confidence (greater than 90%), purple represents a high confidence (80 to 90%), blue represents a medium confidence (60 to 80%), and black represents low confidence (less than 60%).

PEAKS also provide a slider to change the low confident amino acid to mass tag. We call the remaining consecutive high confident amino acids de novo tag.


Those de novo tags may then be used in BLAST.

Friday, September 6, 2013

PEAKS at HUPO 2013 - Yokohama, Japan


The PEAKS team is getting ready to attend the HUPO 12th Annual World Congress. The conference will start on September 14th. This year's theme is "The Evolution of Technology in Proteomics".


We will be exhibiting at booth #11. On our booth, you can get a hand on experience on our current version PEAKS 6 to see how it may help in your research. Additionally, you can take a sneak peek at our upcoming PEAKS 7, which will be released in November. We will also have the help from our Japanese distributor at our booth to better serve the local researchers.

Please drop by at booth #11 if you are going to Yokohama for HUPO congress. If you are busy and could not make it there this time, you can always get the latest information about PEAKS from our website.

We are looking forward to seeing you in Japan.

Friday, August 30, 2013

CHAMPS Antibody Sequencing Workflow

A couple of months ago, we have announced our CHAMPS antibody sequencing service. With the FREE "blind trials" we offered during the promotional period, we have received quite a few dataset from several users. The responses to the results we provided are remarkable.

Here is the general workflow we use to sequence an antibody.


We require the sample to be reduced with DTT, alkylated with iodoacetamide. Glycans need to be removed and heavy/light chains must be separated. Each chain then will be digested with six enzymes: AspN, chymotrypsin, GluC, LysC, pepsin and trypsin. MS/MS spectra is required to be acquired using LTQ-Orbitrap at high resolution with HCD fragmentation. In total, we require six LCMS runs per chain.

The data analysis starts off from PEAKS de novo sequencing. A list of high quality de novo sequences will be generated along with the positional confidence score for each amino acid. Then an in-house developed program will be used to assemble the de novo peptides into much longer sequences, protein contigs. In our experiments, the majority of assembled contigs had a length of 60~120 residues.

We blast the protein contigs in NCBI nr database to assemble the antibody template. We select a protein hit corresponding to the constant region and select the closest protein hit corresponding to the variable region. All the contigs will then be mapped to the template to get the first draft of the antibody sequence. In principle, we trust contigs in variable region and the template in constant region.

The draft sequence will be refined iteratively using PEAKS SPIDER homology search. During each iteration, we will examine insertion/deletion/mutation reported by SPIDER, residues with low peptide coverage, residues at the protein n-terminus and compare the sequence mass with protein intact mass, if available.


*Some content in this post is extracted from the ASMS 2013 poster "Whole Protein de novo Sequencing from MS/MS". You can find a web version of the poster here.

Wednesday, August 21, 2013

De Novo Assisted PTM "Blind Search" - PEAKS PTM

In PEAKS 6, we have introduced a new algorithm for PTM "blind search", PEAKS PTM. This algorithm can search for modified peptides with all 600+ PTMs in the Unimod database. To use this algorithm, simply select the option in the PEAKS Search dialog as shown below.


How is the "blind search" achieved? The secret is de novo sequencing. In usual PTM search algorithm, all possible modification forms of all database peptides satisfying the enzyme digestion rules are tried to match the spectra.


In PEAKS PTM, we only search for PTMs on the peptides when there is a tag match*. The algorithm also takes the PTM rarity into account to reduce the search space and false PTM assignment.

With PEAKS PTM algorithm selected, users only need to specify a very small number of variable PTMs in de novo and PEAKS DB to speed up the search and rely on PEAKS PTM to find other modifications presented in the sample.


*X. Han et al. PeaksPTM: Mass Spectrometry Based Identification of Peptides with Unspecified Modifications. JPR 2011, 10(7): 2930-2936.


Friday, August 9, 2013

de novo only peptides in PEAKS

One of the unique features in PEAKS is that it provides a list of de novo only peptides. What does it mean?


de novo only peptides are the de novo sequences derived from the spectra that do not a have confident database match. The following figure is a simple explanation. 


In PEAKS, de novo only peptides are listed in the de novo only tab in every PEAKS DB, PEAKS PTM, SPIDER and inChorus results. The actual de novo only peptides displayed in the list will be affected by the filters in the summary tab. Let's use PEAKS DB result below as an example.


The de novo only peptides are defined in the second row of the filters. The first part, TLC and ALC filters are the same as it is in the de novo result. It tells PEAKS what should be considered as a good de novo sequence. The second part, peptide -10lgP filter tells PEAKS what should not be considered as a confident database match. After the filters are applied, PEAKS will go through all the de novo sequences that passed the TLC and ALC filters. For each of such de novo sequence, PEAKS will look at the corresponding spectrum. If the spectrum does not produce any database matches that have a score higher than the de novo only peptide -10lgP filter, the de novo sequence of the spectrum will be considered as a de novo only peptide.

PEAKS does not stop at just providing a list of de novo only peptides, it also tries to associate them with the proteins. In the protein coverage view below, the gray bars represents the de novo only peptides that share at least 10 consecutive AAs with the protein sequence. This is particularly useful when looking for unexpected PTMs or glycosylation site.



Monday, July 29, 2013

FDR on combined result from multiple search engines

There are many database search software available for peptide identification. Every software use different scoring functions, thus give them complementary abilities in assigning different spectra from the same MS/MS dataset. This makes combining multiple search engine results a popular methods among researchers.

The tricky part when dealing with the combined results is how to control the quality. In PEAKS 6, we introduced a very easy to use filter in the inChorus result to help resolving this issue. To use this feature, 'Search decoy database from PEAKS' option must be selected for individual search engines when performing inChorus search.

In the 'inChorus' summary view, there is a 'Edit filters' button to control the FDR at the PSM level.


Click the button, a dialog with filter details will be displayed.


The easiest way is to select the target inChorus FDR and PEAKS will automatically determine the appropriate score threshold for each individual search engine so that the combined result has the FDR the user specified. PEAKS also gives users the flexibility to select the score threshold for individual engines manually. When the filters are applied, the overall FDR of the combined result will be calculated.







Tuesday, July 16, 2013

PEAKS performs excellently on AB SCIEX TripleTOF 5600 data

AB SCIEX TripleTOF 5600 is a powerful instrument that provides high mass accuracy and high resolution in both MS and MS/MS modes. With the ability to acquire a maximum of 50 MS/MS spectra per second, the instrument makes a great component for a proteomics research platform.

PEAKS algorithm has been optimized specifically for this type of instrument in version 6. A comparative study was performed using iPRG 2012 dataset against Mascot and ProteinPilot. The results are shown in the following table.


At 1% FDR, PEAKS 6 was able to identify twice as many PSMs as the popularly used Mascot + Percolator combination. Even compared with AB SCIEX’s ProteinPilot software, PEAKS 6 identified 29% more PSMs.


*The content of this post is extracted from ASMS 2012 poster "Optimized Database Search Software for Peptide Identification with AB SCIEX TripleTOF 5600". You can find a copy of the poster here.

Thursday, July 4, 2013

How to use Decoy-Fusion on Mascot in PEAKS inChorus?

PEAKS supports FDR estimation on inChorus results. To make this work on Mascot results, there are a few extra steps to follow.

PEAKS uses Decoy-Fusion method for FDR estimation. The first step is to create a decoy-fusion database. Go to PEAKS database configuration dialog. Select the FASTA database you want to search against. Then click the "Export Decoy DB" button. A decoy-fusion FASTA file will be generated.


The second step is to configure the decoy-fusion FASTA file into Mascot. This is very straightforward in Mascot 2.4 as the parsing rule of PEAKS decoy-fusion method can be automatically detected.

After the decoy-fusion database is up and running on Mascot server, the last step is to make sure that the "Search decoy database from PEAKS" option is selected in the search dialog.





Monday, June 24, 2013

Carbamidomethyl @ C, D, E, H, K, N-term

Recently I worked on an ETD dataset generated from Orbitrap Velos. The user mentioned that Carbamidomethyl on Cysteine and some phosphopeptides are expected but he was not able to get any good identification results using Mascot.

My first attempt is to use the information provided by the user and run the PEAKS DB search. The result shocked me as under 1% FDR, PEAKS DB only reported 170 PSMs. What could possibly go wrong?

Looking at the PEAKS DB result, there are many "de novo only" peptides, which means many spectrum can produce confident de novo sequences but they do not have a confident database hit. This could be a result of unsuspected PTMs and mutations.

So I decided to do a PEAKS PTM search on the db result as my second attempt. PEAKS PTM reported 582 PSMs under 1% FDR. In the summary view, PTM profile section, there are many PSMs found with the PTM Carbamidomethyl on D, E, H, K and N-term. This could be an indication of excess iodoacetamide during alkylation procedure.


Now I know more PTM information of this data. In the last attempt, I added Carbamidomethyl @ DEHK,N-term and Dehydration @ DST into the variable PTM list along with Phosphorylation @ STY. PEAKS DB search was then performed with PEAKS PTM option enabled. This time, under 1% FDR, 736 PSMs were reported, more than four times the number of the initial search.

PEAKS PTM is a great tool to find unsuspected PTMs thus help explaining more spectrum.

Monday, June 17, 2013

PEAKS @ ASMS 2013

It was a very successful conference! There are many exciting new instruments, interesting research and talks.

PEAKS user meeting was very well attended. Researchers from the globe got a sneak peak of our latest research and enjoyed the lunch buffet.

Dr. Bin Ma is giving a talk at PEAKS user meeting

Our booth is always crowded with people interested in our solution or service. Here is a picture I captured just after the booth has been setup before the conference started.


Many people that missed our user meeting are asking for the slides of the talks. We will polish the slides and post it online as soon as possible.

Monday, June 3, 2013

Stop Guessing, Start Knowing - Score threshold and FDR control in PEAKS

In an MCP guideline published in 2004, "a significant but undefined number of proteins being reported as identified in proteomics articles are likely to be false positives".


To tackle this problem and control the "quality" of the identification results, over the past decade, false discovery rate (FDR) becomes the most accepted result validation method.

Although many search engines have the FDR estimation integrated into their package, in order to get the result under a particular FDR, e.g. 1%, the researchers usually have to do several round of "trial and error" on the score threshold used for filtering the result. Even worse, this process may have to be performed on every results.

PEAKS introduced a score selection tool to make this process very simple. Normally, it only takes three mouse clicks. Here is how to do it. On the top of the result summary pane, there is a FDR button.


Simply click that button will give you the score selection dialog. In this dialog, you can quickly select the commonly used FDR values on the right by a single click of the corresponding button. Or you can hover the mouse cursor on the chart to find the desired FDR value. When the value is found, right-click and then select "copy score threshold", the score to be used for the desired FDR will be filled into the score filter automatically. Click the "Apply Filters" button, you will get the result under the desired FDR.




Monday, May 27, 2013

CHAMPS - Antibody Sequencing Service

The PEAKS team is proudly announcing the launch of the Antibody Sequencing Service - CHAMPS.

Backed by the market leading de novo sequencing algorithm and PEAKS complete analysis workflow, our scientists are able to offer a fast and professional service for obtaining primary sequences of monoclonal antibodies with modifications.

Please contact us at champs@bioinfor.com for a test drive! More information of the service is here.

Wednesday, May 22, 2013

More than a decade of PEAKS

Staring at the calendar, I can not believe it is almost half way into the year of 2013. The PEAKS team is hard at work and strive to better serve the mass spec proteomics community by adding tons of new features to every releases.

Ever since the first release, PEAKS has gone through many iterations and/or dramatic changes to become what it is today.


What new features will be in this year's release? Stay tuned!

Thursday, May 16, 2013

PEAKS User Meeting co-located with ASMS 2013

We will be hosting the 7th Annual PEAKS User Meeting on June 9th, 2013 at Minneapolis Convention Center, Room 103A, co-located with ASMS conference.

Please join us and register today to reserve your seat!

http://www.bioinfor.com/peaks/corp/conferences/peaks-asms-2013.html

Here are the tentative agenda.
12:30 - 1:00 Lunch
 1:00 - 1:30 Facts and Fallacies about de Novo Sequencing and Database Search
Dr. Bin Ma, CTO at BSI and Professor at the University of Waterloo
 1:30 - 1:50 Automatic Validation of de Novo Sequencing Result
Lian Yang, Research Scientist at BSI
 1:50 - 2:10 Antibody Sequencing with LC-MS/MS
Dr. Baozhen Shan, Senior Application Scientist at BSI
 2:10 - 2:30 Common Use Cases of PEAKS Studio
Dan Maloney, Application Scientist at BSI
 2:30 - 3:30 Free Discussion. Ask the onsite BSI employees for questions and
best practices about using PEAKS in your specific application.

Monday, May 13, 2013

Multiple enzymes support in PEAKS - Full Protein Coverage

PEAKS 6 introduced a new feature specifically targeting the experiments that use multiple enzyme digestions to increase protein coverage.

In the past, users have to search each sample separately and combine all the results manually afterwards or using none enzyme option to analyze all samples in one go which may cause higher false positives. Now in PEAKS, users can specify enzyme for each sample when creating a multi-sample project. Then in de novo sequencing and PEAKS DB, user can choose 'sample enzyme' in the enzyme list as the search option. PEAKS will use the correct enzyme when analyzing each sample.

From our users feedback, this feature is extremely useful when you want to fully characterize a single protein.The following example shows how big a difference this feature may make.

ALBU_BOVIN protein ordered from a reputable vendor was digested with Trypsin, LysC, GluC. The dataset is generated from Thermo Orbitrap instrument. Three searches were performed. The first one uses inChorus function to launch Mascot search (version 1.4) on the trypsin sample only. The second search uses standard PEAKS DB search on the trypsin sample. The third search uses the complete analysis workflow, including PEAKS PTM and SPIDER, on all three samples and uses "sample enzyme" as the enzyme option. The results are all filtered to only keep the very confident PSMs at 0.1% FDR level.

Mascot and PEAKS DB are able to achieve 73% and 86% protein coverage using only the trypsin sample respectively. In the protein coverage view below, the blue bars are the PSMs that matched the protein sequence at that position.

Mascot result on the trypsin sample
PEAKS DB result on the trypsin sample

PEAKS complete analysis on all three samples reported 96% coverage on the protein. The uncovered 4% is in the protein N-terminal region, which is most likely cleaved-off and not in the purchased sample1.
PEAKS complete analysis result on all three samples

1specific binding site (Asp-Thr-His-Lys) for Cu(II) ions. T. Peters Jr., F.A. Blumenstock. J. Biol. Chem., 242 (1967), p. 1574

Monday, May 6, 2013

Configure FASTA database in PEAKS

Configuring FASTA databases in PEAKS is fairly easy especially if the FASTA file has the same header format as one of the public databases (e.g. NR, Swiss-Prot, IPI). It is just a matter of selecting the pre-defined format and the parsing rules will be automatically filled in.

There are also a large number of users use PEAKS to search on their in-house, customized FASTA databases. In this situation, the header format is very hard to predict and it varies case by case.

In PEAKS, the parsing rule is defined using regular expression. While regular expression is very powerful, it will take people quite a bit of time to master it. Since we got tons of searches to run every week, against FASTA files with so many different header formats, I created this lazy, generic parsing rule for internal use and in most cases, it worked good enough.

Accession. The regular expression tries to use everything before the first white space as the accession. If no white space were found within the first 30 characters, the first 30 characters will be used as accession.
>\([^\s|]{1,30}\)
Description. The whole line after ">" will be used as the description.
>\(.*\)
Here is a screenshot of the custom FASTA database configuration in PEAKS.




Wednesday, May 1, 2013

100% vs 50% CPU usage, twice as fast? Not really!

Some user observed that when performing a search, the CPU usage for PEAKS would only go up to 50%. Why PEAKS does not use 100% of the CPU?

The observation is for sure valid, but the CPU usage reported by Windows Task Manager is somewhat misleading. 100% CPU usage does not mean the program is running twice as fast as under 50% CPU usage. The reason for this is, in my opinion, due to the Hyper-Threading technology most Intel CPUs have enabled by default. While the technology can improve the performance and responsiveness of a computer in some situation, it does not help much for computation heavy application, like PEAKS.

I did a performance test on a desktop PC with the following specification, a quad-core CPU with lots of RAM.
Intel i7 3770 3.4GHz CPU (quad core with hyperthreading)
16GB RAM
System drive SSD, data drive 7200RPM HDD
Windows 8 Pro 64bit
The dataset contains about 9000 MS/MS spectrum from Thermo Orbitrap instrument. PEAKS de novo was manually configured to run on 1, 2, 4 and 8 threads configuration respectively. For each configuration, two searches were done, one with hyper-threading enabled, one with hyper-threading disabled. So in total, there were 8 runs, a clean project was created for each run and the PC was rebooted between the runs.



As you can see on the chart, with HT enabled, there is only about 10% performance gain running 8 threads (100% CPU usage) than running 4 threads (50% CPU usage). The search indeed run slightly faster, but the computer is not very responsive for even the simplest tasks like email, excel, etc.




Friday, April 26, 2013

Common ptifalls of FDR estimation part three


The third pitfall is also caused due to the over-emphasis on sensitivity.

There is another trend in database search software to re-score the peptide identification results by using machine learning. The idea is straightforward: After the search, we know what the decoy hits are. The algorithm should take advantage of it, and retrain the parameters of the scoring function to get rid of the decoy hits. With this effort, it will get rid of a lot of the target false hits as well.

The method is valid, except that it may cause FDR underestimation. This is because the target false hits are unknown to the machine learning algorithm. Therefore, there is a risk that the machine learning algorithm removes more decoy hits than the target false hits. 

This overfit risk is well known in machine learning. A machine learning expert can reduce the risk but can never get rid of it. 

The solution to this pitfall number 3 is trickier.

The first suggestion: don’t use it. The philosophy here is that judges cannot be players. If we want to use the decoy for result validate, the decoy information should never be released to the search algorithm.

If this re-scoring method must be used due to the low-performance of some database search software, it should only be used for very large dataset to reduce the risk of over-fit.

Perhaps the best solution is the third one. That is, the retraining of the score parameters should be done for each different instrument type, instead of each dataset. This will gain much of the benefit provided by machine learning, but without the problem of over-fitting. Indeed, this third approach is what we do in the PEAKS DB algorithm.


*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.

Monday, April 22, 2013

Common ptifalls of FDR estimation part two


The second pitfall of the traditional target-decoy strategy is caused by another popular technique used to increase the peptide identification sensitivity.  

The idea is clever: if a weakly identified peptide happens to be on a highly-confident protein, then the peptide is likely to be correct regardless of its low score. So, to increase the sensitivity, the software can add a score bonus to each peptide on a multiple-hit protein. Indeed, this protein bonus will save some weak true hits, but it will save some weak false hits at the same time. The bigger problem is that the target database will provide more multiple-hit proteins than the decoy. As a result, more weak false hits will be saved from the target database. This will cause the FDR underestimation.


In PEAKS, decoy fusion approach can solve this problem effectively. 

Because the target and decoy sequences are concatenated into a single protein sequence, when a protein bonus is added to the multiple-hit proteins, the same bonus will be added to the target and decoy hits equally. So, weak false hits are saved with approximately equal probabilities in the target and decoy. This recreates the balance and provides accurate FDR estimation.

By using the decoy fusion as the validation method, we can safely apply the protein bonus. We get the sensitivity, but did not compromise  the FDR estimation.

 
*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.

Friday, April 19, 2013

Common ptifalls of FDR estimation part one

Today’s most widely used method for FDR estimation is the target-decoy strategy. This is a well-established method in statistics and started to be used in proteomics around 2007.

In this approach, a decoy database that contains the same number of proteins as the target database are searched together by the database search engine to identify peptides. As illustrated in the figure, the blue colors indicate the target hits and the orange colors indicate the decoy hits, the squares are the false hits, and circles are true hits. 


The decoy proteins are randomly generated so that any decoy hit is supposedly a false hit. Since the search engine doesn’t know which sequences are from target and which are from decoy, when it makes a mistake, the mistake falls in the target and decoy databases with equal probability. Thus, the total number of false target hits can be approximated by the number of decoy hits in the final result. And the FDR can be estimated by the ratio between the numbers of decoy hits and the number of target hits. 

The target-decoy strategy is a powerful method for FDR estimation. However, as we will discover in the next little while, such a powerful method must be used with caution to avoid FDR underestimation. 

The first pitfall in the use of target-decoy approach for FDR estimation is due to the so-called multiple round search strategy in today’s database search software. 

This multi-round search was popularized by the X!Tandem program published in 2004, in order to speed up the computation. The first round uses a fast but less sensitive search method to quickly identify a shortlist of proteins from the large database. Then, the second round uses a more sensitive but slower search method to identify peptides, but only from the short list of proteins. This effectively speeds up the search without sacrificing too much sensitivity. Indeed, X!Tandem is one of the fastest search algorithm used today.

However, as pointed out by a paper published in JPR in 2010, this multiple-round search strategy screws up the target-decoy estimation of the FDR. The reason is that after the first round, there will be more target proteins than the decoy in the short list. Thus, if the second round search makes a mistake, the mistake will be more likely in the target proteins. So, we will end up with fewer decoy hits than the actual false target hits. This causes the FDR underestimation.


The JPR paper in 2010 provided a fix to this problem. But a year later, in another JPR paper, Bern and Kil pointed out that the fix was wrong, and proposed a different fix that required the change of the search engine’s algorithm. This shows that the FDR estimation is very tricky, even the experts can sometimes get it wrong. 

In PEAKS, we used a new approach, called decoy fusion to solve this problem. 

Instead of mixing the target and decoy databases, we append a decoy sequence to each target protein.

So, after the fast search round, the protein shortlist will still contain the same length of target and decoy sequences. And the false hits of the second round will have the equal chance to be from the target and decoy sequences. This recreates the balance and can accurately estimate the FDR in the multiple-round search setting.


*The content of this post is extracted from "Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy" by Dr. Bin Ma, CTO of Bioinformatics Solutions Inc. You can find the link to the guide on this page.