Monday, May 6, 2013

Configure FASTA database in PEAKS

Configuring FASTA databases in PEAKS is fairly easy especially if the FASTA file has the same header format as one of the public databases (e.g. NR, Swiss-Prot, IPI). It is just a matter of selecting the pre-defined format and the parsing rules will be automatically filled in.

There are also a large number of users use PEAKS to search on their in-house, customized FASTA databases. In this situation, the header format is very hard to predict and it varies case by case.

In PEAKS, the parsing rule is defined using regular expression. While regular expression is very powerful, it will take people quite a bit of time to master it. Since we got tons of searches to run every week, against FASTA files with so many different header formats, I created this lazy, generic parsing rule for internal use and in most cases, it worked good enough.

Accession. The regular expression tries to use everything before the first white space as the accession. If no white space were found within the first 30 characters, the first 30 characters will be used as accession.
>\([^\s|]{1,30}\)
Description. The whole line after ">" will be used as the description.
>\(.*\)
Here is a screenshot of the custom FASTA database configuration in PEAKS.




No comments:

Post a Comment