Help

FAQ Methods Help Amino Acid Colors
Basic
Acidic
Polar
Nonpolar
Background probability distribution
The relative entropy (also known as Kullback-Leibler divergence) is a measure of difference between two probability distributions P and Q. If, say, the distribution P specifies the frequency of occurance of particular amino acids in a given column of a multiple sequence alignment, the distribution Q represents a background frequency of a random sample. The option "Background frequencies" enables to specify a custom background probability distribution.

The uploaded file should contain only the one-letter amino acid codes, newlines and comment lines starting with ">" such as

ALYEDPPDHKTSPSGKPVLYEDPPDQKTSPSGKS
VLYEDPPDQKTSASGKS
or
>APE_H._sapiens         
ALYEDPPDHKTSPSGKP
>APE_M._musculus        
VLYEDPPDQKTSPSGKS
>APE_R._norvegicus      
VLYEDPPDQKTSASGKS

If unchecked, the default background frequencies will be used.

Blast Expectation Value
Expected number of chance matches in a random model. This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (1.0) means that 1 such match is expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

Setting the value to 0.001 will produce more homogeneous results, raising to 10.0 will produce more heterogeneous group of sequences.

Create New Motif
By default, MotifSearch looks in a database of proteins for motifs found by MotifMaker in a multiple sequence alignment. It can also create a motif from an arbitrary substring of the reference sequence of a multiple sequence alignment (see the Motif from a Profile Substring option) or even Create New Motif out of thin air. Extensive use of the last option is not recommended, because motifs are more than just sequences of residues. Each motif entering MotifSearch calculations is accompanied by information about relative entropies, standard deviations and average quantitative descriptor values. Moreover, the Bayesian method for scoring sequences requires on the input also the original multiple sequence alignment. With the Create New Motif option, all this information is missing.

Multiple motifs (strings) can be given, separated by space. The strings can contain dots, suggesting insignificant residues.

Cut-offs
The values are always from the interval 0-1, but the specific values differ for different settings. For example, the total entropy can be calculated from the five relative entropies in different ways. Also the background probability distributions may differ.
Significant Residues Motifs / Maximum Value Approach
In the maximum-value approach, the total entropy of 100% conserved Leu or Val is 0.394 and the total entropy of 100% conserved Cys is 1.000.
Significant Residues Motifs / Unscaled Sum Approach
In the unscaled-sum approach, the total entropy of 100% conserved Leu is 0.497 and the total entropy of 100% conserved Cys is 1.000.
Significant Residues Motifs / Scaled Sum Approach
In the scaled sum approach, the total entropy of 100% conserved Leu is 0.545 and the total entropy of 100% conserved Tyr is 1.000.
Similar Residues Motifs
The similarity value is 1.000 for absolutely conserved positions and 0.000 for very variable positions. The scale is not linear, reasonably conserved positions have values larger or equal to 0.500.
Gap length
The maximum length of gaps (non-significant residues) allowed in a motif.
Include Gaps
Motif Maker can include positions containing gaps into motifs. This makes sense, because it is possible that the gap is present only in the reference sequence.
Input Sequences
There are several ways how to feed sequences into PCPMer. If you already have a multiple sequence alignment, go directly to MotifMaker or the 3D Variability tool. If you don't, use one the following options:
  • Copy and paste your sequence of interest in the form and use the run BLAST search option to find sequences related to your sequence.
  • Put your sequences into a ClustalW input file (i.e. Fasta format) and use the upload file option.
  • Use the download file option to automagically connect your web service with PCPMer. (See, for example, the "PCPMer analysis" type of display in the Flavitrack database.)
    http://landau.utmb.edu:8080/pcpmer/Tools/alignments.jsp?alignmentDownload=LINK_TO_THE_ALIGNMENT
The ClustalW is run with default parameters and only files of a limited size are permitted. If you need more control over your multiple sequence alignment, run the program locally. Refer to ClustalW documentation for supported formats.
Minimum size
The minimum number of conserved residues in motifs.
Motif Types
Significant Residues Motifs
PCPMer traditionally recognised significant columns of a multiple sequence alignment by means of relative entropy, also known as the Kullback-Leibler divergence. This approach will identify positions with small probabilities of spontaneous occurance, without these being conserved across the column of the multiple sequence alignment. For example, a column containing the Glu and Pro residues in the ratio 1:1 will be identified as highly unlikely and therefore significant, although both amino acids are very different. Another example, a column rich in Leu and Ile residues has small relative entropy, although the physicochemical properties are well conserved. This is due to the fact that these residues are very abundant.
The Kullback-Leibler divergence is defined as
\sum_{i=1}^{n} Q_p(i) \log_2\frac{Q_p(i)}{P_p(i)}
where Q are the discrete probability distributions of the five property values observed in the alignment and P are the probability distributions of a random sample. In this equation, the index p iterates over the five properties E1 to E5 and the index i over the discrete probability distribution bins (the 20 amino acids were grouped into 5 bins for each vector, so n=5). Thus Q(i) is the fraction of the component p observed in the bin i and P(i) is the corresponding background frequency.
Similar Residues Motifs
An alternative definition of a motif is based on the overall physicochemical similarity of amino acids across a column of a multiple sequence alignment. The physicochemical distance between two amino acids i and j is defined as the Euclidean distance in the five-dimensional space of the property descriptors E1-E5 (the square-root term in Equation 1 below). The average physicochemical distance D of a whole column of a multiple alignment is then defined as the average of all pairs' distances (Equation 1).
In order to display the physicochemical distances using a color scale, a fixed range of values is required. PCPMer therefore displays the similarity values instead of physicochemical distances, as defined in Equation 2. The similarity is equal to 1 for absolutely conserved columns and quickly falls to 0 for more diverse columns. The similarity also contains a term which lowers its value when gaps are present in the column of the alignment.
D &=& \frac{2}{n (n+1)} \sum_{i<j}\sqrt{\sum_{p=E1}^{E5} (V_i^p - V_j^p)^2} \\ S &=& e^{-0.1 D} \frac{N_{\text{no gaps}}}{N}
where Vip are the five quantitative descriptor values of the i-th amino acid; N is the number of sequences in the multiple alignment; Nno gaps is the number of sequences not containing a gap in the given column; D is the physicochemical distance; and S is the similarity.
Motif from a Profile Substring
By default, MotifSearch looks in a database of proteins for motifs found by MotifMaker in a multiple sequence alignment. The Motif from a Profile Substring dialog enables to enter any substring of the reference sequence of the alignment. Because arbitrary substrings will have in general small relative entropies, we advise to decrease the value of the Significance Threshold parameter.

Multiple substrings can be given, separated by space. The strings can contain dots, suggesting insignificant residues. The substring must be unique within the reference sequence.

Note: This dialog is not the same as the Create New Motif dialog.

MotifMaker Result Page
Move the mouse cursor over the list to highlight the features on the screenshot image:
  • - The shaded areas in the first row indicate recognized motifs.
  • - The columns with asterisk (*) indicate absolutely conserved columns.
  • - When the mouse cursor is positioned over the one-letter codes of the reference sequence, the index of the residue is displayed. The first number is relative to the reference sequence and the second number is the column of the multiple alignment.
  • - The red fields in the "similarity" row correspond to variable columns, green fields correspond to conserved columns. When the mouse cursor is positioned over the fiels, the calculated value of similarity will be displayed (a number from the interval 0-1). The similarity of absolutely conserved columns is 1.
  • - The gray fields in the rows D1-D5 show the average physicochemical distances for each of the five properties. The dark colors correspond to low distances and therefore to more conserved columns. When the mouse cursor is positioned over the fields, the calculated value of physicochemical value will be displayed. The physicochemical distance of absolutely conserved columns is 0.
  • - The red fields in the "total E" row correspond to variable columns with low relative entropies, blue fields correspond to columns with high relative entropies. The actual value of relative entropy is displayed when the mouse cursor is positioned over the fields.
  • - The gray fields E1-E5 show the relative entropies for each of the five properties. The dark colors correspond to high entropies, light colors to low entropies.
  • - The bottom part of the table shows which motifs would be recognized with different cutoff values.
MotifMaker result page screenshot
MotifSearch Database
Motif Search can operate in two ways:
  1. Search the Astral database and find sequences which contain selected motifs. (Are you interested in other than Astral databases? Let us know!)
  2. User can enter a protein sequence and Motif Search will determine how well the motifs match the sequence at each position of the sequence. Multiple sequences in fasta format may be given.
MotifSearch Result Page
Move the mouse cursor over the list to highlight the features on the screenshot image:
  • - The searched motifs section shows the characteristics of the searched motifs. The fields marked with x indicate which of the five quantitative descriptors exceeded the significance threshold. The unmarked fields were not used when scoring the motif against a sequence.
MotifSearch result page screenshot
  • - In the "search within a sequence" mode, the motifs are aligned with every position of the sequence and scores are calculated. The score for each position is displayed on a color scale. White indicates low scores, blue represents high scores. In the example below, the blue D indicates that the motif DRGWGNGC..FGKG gives almost a perfect match with the sequence at that position. When the mouse cursor is positioned over the fields, the actual value of the score and the position is displayed. Below, the top five positions are shown in detail and aligned with the motif.
MotifSearch result page screenshot
  • - In the "search in a database" mode, the motifs are also aligned with every position of each sequence from the database and scores are calculated. However, only the best matching positions are displayed and the total score is calculated. Best scoring sequences are listed first. Note that the total score is used only for determining the relative order of the best matches and its actual value is meaningless and cannot be compared with values obtained in different searches.
    The calculation of the total score is based on the score values (Seq) for each motif and on the average scores calculated for each sequence of the multiple alignment (Aln) and for each sequence of the database (DB). If score filter is turned on, these values decide whether a motif will make a contribution to the total score or not. With fixed cutoff set, all motifs with the score value ( Seq ) smaller than the cutoff will be ignored. With mean cutoff set, all motifs with the score value ( Seq ) smaller than ( Aln + DB )/2 will be ignored.

    The motifs colored red have been evaluated as random, non-significant matches, because their score is within two standard deviations from the average score ( Seq < DB avg + 2* DB dev ). That is, with this level of accuracy, one can find very many motifs. The motifs colored black deviate somewhat from the standard distribution in the database and the motifs colored green deviate significantly. Note that the color code is for basic orientation only. It is up to the user to decide which hits can be considered as biologically meaningful.

    When the mouse cursor is positioned over the fields, the position of the amino acid in the reference sequence will be displayed. When the icon in the "Description" column is clicked, the sequence is displayed without aligned motifs, a feature useful for copying the sequence into clipboard and pasting it elsewhere.
MotifSearch result page screenshot
MotifSearch Score Filter
Quick explanation:
Choose on of the Fixed Cutoff or Exclude non-matching options to filter non-significant motifs from the results, individually for each sequence.

Detailed explanation:
MotifSearch walks through a protein database and for each of the sequences calculates a total score for the given set of motifs. The total score for a given sequence is based on "similarity" scores calculated for each of the motifs. The similarity score is a value between 0 and 1, where the extreme value of 1 indicates that the sequence contains a perfect match for the motif.

The Score Filter enables to drop non-significant matches out of the total score evaluation (and from the display). The Fixed Cutoff option filters out those contributions to the total score, which are smaller then the given threshold. The Exclude non-matching option does the same, but sets the cutoff dynamically, for each motif separately. The dynamic thresholds are set to two standard deviations from the average score value, calculated for the whole protein database. In other words, although a motif may have the score value quite low for the given sequence, it still may be identified as important as long as it gives higher score for the given sequence than for the other sequences in the database.

MotifSearch Significance Threshold
Roughly speaking, a motif is a sequence of residues with some physicochemical property conserved. Each of the physicochemical properties can be conserved to a different degree, for example, a polarity of aminoacid may be important, but the size of the residue may be variable. Only those properties will be considered as significant for a given residue by MotifSearch, which have relative entropy larger than the Significance Threshold.
MotifSearch input data
MotifSearch looks for sequences which contain motifs previously identified by MotifMaker in a multiple alignment. Note that a motif is much more than a sequence of residues. Each motif entering MotifSearch calculations is accompanied by information about relative entropies, standard deviations and average quantitative descriptor values. Moreover, the Bayesian method for scoring sequences requires on the input also the original multiple sequence alignment. For these reasons, the prefered way is to search for motifs created by MotifMaker, rather than use the Create New Motif option.
Multiple Alignment
The program accepts various formats, see the examples below.

Standard ClustalW output:

Clustal 2.0

APE_H._sapiens         ALYEDPPDHKT
APE_M._musculus        VLYEDPPDQKT
APE_R._norvegicus      VLYEDPPDQKT

APE_H._sapiens         SPSGKP-----
APE_M._musculus        SPSGKS-----
APE_R._norvegicus      SASGKS-----
or each sequence on one line:
APE_H._sapiens         ALYEDPPDHKTSPSGKP-----
APE_M._musculus        VLYEDPPDQKTSPSGKS-----
APE_R._norvegicus      VLYEDPPDQKTSASGKS-----
or PIR format:
>APE_H._sapiens         
ALYEDPPDHKTSPSGKP-----
>APE_M._musculus        
VLYEDPPDQKTSPSGKS-----
>APE_R._norvegicus      
VLYEDPPDQKTSASGKS-----
Number of intervals
The relative entropy (also known as Kullback-Leibler divergence) is a measure of difference between two probability distributions P and Q. The option "Number of intervals" specifies how many bins should be used for the discrete probability distributions.

This option works only with the "Background frequencies" option.

Predefined Vector Libraries
Bacteria SwissProt 56.0
Five quantitative descriptors, five bins, background frequencies from Bacteria SwissProt 56.0 (70,064,437 residues).
Eukaryota SwissProt 56.0
Five quantitative descriptors, five bins, background frequencies from Eukaryota SwissProt 56.0 (61,250,470 residues).
Flaviviruses
Five quantitative descriptors, five bins, background frequencies from 33 Flaviviruses (112,295 residues).
Human SwissProt 56.0
Five quantitative descriptors, five bins, background frequencies from Human SwissProt 56.0 (11,101,461 residues).
SwissProt 40.0
Five quantitative descriptors, five bins, background frequencies from SwissProt 40.0 (37,308,249 residues). Venkatarajan, M.S., Braun W., 2001, J. Mol. Modeling 7:445-453.
SwissProt 56.0
Five quantitative descriptors, five bins, background frequencies from SwissProt 56.0 (141,206,021 residues).
Viruses SwissProt 56.0
Five quantitative descriptors, five bins, background frequencies from Viruses SwissProt 56.0 (5,657,599 residues).
Quantitative descriptors
The concept of quantitative descriptors enables to represent each of the 20 naturally occuring amino acids is as a point in a multidimensional space. PCPMer can work with different descriptors, but currently only one set of these is available through this web interface. If you'd like to try also different descriptors, please contact us.
Venkatarajan, Braun 2001
Five quantitative descriptors, introduced in Venkatarajan, M.S., Braun W., 2001, J. Mol. Modeling 7:445-453.
The five dimensions roughly correspond to hydrophobicity/hydrophylicity (E1); size (E2); alpha-helix propensity (E3). The property E4 is partially related to the partial specific volume, number of codons and relative abundance of the amino acids; and E5 correlates weakly with beta-strand propensity.
Reference Sequence
On output, Motif Maker will use the specified sequence for the motif definition. If empty, the first sequence of the multiple alignment will be used as the reference sequence.
Remove redundant
Check this option if you wish to reduce the number of redundant sequences in the BLAST search results by CD-HIT.
Substitutions
Motif Maker can work only with the 20 standard amino acids. If your multiple sequence alignment contains letters other than the standard set (A, C, D, E, F, G, H, I, L, K, P, M, N, Q, R, S, T, V, W, Y, -) you have two choices: Either edit your multiple alignment file manually, or tell the program to make the substitutions for you.

The substitutions must be specified as a comma-separated list of records ORI:NEW such as for example X:-,B:-,O:-.

The PCPgprf file
The PCPgprf file generated by MotifMaker contains statistical information gathered from the multiple sequence alignment supplied by the user. Each line of the file corresponds to one column of the multiple sequence alignment. The number of columns on each line may differ for different sets of quantitative descriptors. In the default case of 5 descriptors, the meaning of columns is following:

  • 1. Column of the multiple sequence alignment. (Indexed from 1.)
  • 2. Number of sequences in the multiple sequence alignment.
  • 3. Number of sequences in the multiple sequence alignment.
  • 4. The line of the .gprf file (indexed from 1). Corresponds to the index of the reference sequence (when gaps absent) or of the alignment column (when gaps present).
  • 5. Amino acid of the reference sequence.
  • 6-10. Average values of the quantitative descriptors for the given column of the multiple sequence alignment.
  • 11-15. Standard deviations of the quantitative descriptors.
  • 16-20. Relative entropies of the quantitative descriptors.
  • 21. Total entropy.
  • 22. Similarity.
  • 23. Physicochemical distance.

Example:

  1  25  25   1 T  -2.6647   2.6822  -1.0055  -0.0089  -0.6522   9.8122   4.7013   2.7967   2.5571   3.1612   0.0280   0.1050   0.0365   0.1376   0.0283   0.1206   0.2222  14.6333
  2  25  25   2 R   8.1694  -5.1435   2.0387  -1.2167  -3.1891   3.4569   5.4552   1.2068   3.3977   2.4961   0.1353   0.1069   0.2481   0.0739   0.1943   0.2727   0.3943   8.8977
  3  25  25   3 C  -5.2001   5.3389   1.5350   9.0179  -5.5305   3.2743   2.4343   0.5373   2.5872   1.8762   0.6186   0.1592   0.3585   0.8598   0.3071   0.8280   0.7964   1.8684
  4  25  25   4 T  -6.4642   2.2470  -3.3837  -2.6776  -1.8299   8.2224   3.0542   2.4973   2.0986   3.4671   0.1026   0.1244   0.1248   0.0416   0.0850   0.1720   0.3010  12.0053
  5  25  25   5 H   6.1173   6.4708   0.1283   1.4389   0.2912   5.8624   8.9879   0.8977   2.2588   1.9235   0.1024   0.1548   0.1552   0.1688   0.0819   0.2384   0.2685  13.1477
  6  25  25   6 L -11.1182   0.7667  -3.4099  -1.5254   0.1100   5.2309   3.1290   3.1153   3.9061   3.0940   0.2096   0.1117   0.1696   0.1342   0.0179   0.2312   0.3700   9.9420
  ....
Total Entropy Method
Motif Maker calculates five relative entropy values for each position (column) of the multiple sequence alignment. These five entropies correspond to the five quantitative descriptors.

In the Maximum Value approach, those positions are recognised as significant, where one of the five entropies exceeds a given threshold.

In the Unscaled Sum approach, the sum of the five entropy values is taken as the total entropy.

In the Scaled Sum approach, the five entropy values are scaled to yield a "uniform" space, and then their sum is taken as the total entropy.

NOTE: This parameter has no effect for calculations based on similarity. (See the option Motifs based on above in Motif Maker or the option Variability plot based on above in 3D Variability.)

Variability Plot - Reference Sequence
Fill in the name of one of the sequences from the multiple sequence alignment. This sequence will be used to map residues of the PDB file to the corresponding columns of the multiple alignment.
Variability color scale
By default, absolute coloring scheme is used and is based on the values of color scale parameters. So, for example, the blue color corresponds to the similarity value equal to 1.0, white color to the similarity equal to 0.5, and red color to the similarity value equal to 0. The Use Floating Colors dialog must be unchecked for this.

When the Use Floating Colors option is checked, the blue (resp. white or red) color will correspond to the largest (resp. average or least) similarity value found in the multiple sequence alignment.