Help
FAQ Methods Help- Amino Acid Colors
- Background probability distribution
- Blast Expectation Value
- Create New Motif
- Cut-offs
- Gap length
- Include Gaps
- Input Sequences
- Minimum size
- Motif Types
- Motif from a Profile Substring
- MotifMaker Result Page
- MotifSearch Database
- MotifSearch Result Page
- MotifSearch Score Filter
- MotifSearch Significance Threshold
- MotifSearch input data
- Multiple Alignment
- Number of intervals
- Predefined Vector Libraries
- Quantitative descriptors
- Reference Sequence
- Remove redundant
- Substitutions
- The PCPgprf file
- Total Entropy Method
- Variability Plot - Reference Sequence
- Variability color scale
Acidic
Polar
Nonpolar
The uploaded file should contain only the one-letter amino acid codes, newlines and comment lines starting with ">" such as
ALYEDPPDHKTSPSGKPVLYEDPPDQKTSPSGKS VLYEDPPDQKTSASGKSor
>APE_H._sapiens ALYEDPPDHKTSPSGKP >APE_M._musculus VLYEDPPDQKTSPSGKS >APE_R._norvegicus VLYEDPPDQKTSASGKS
If unchecked, the default background frequencies will be used.
Setting the value to 0.001 will produce more homogeneous results, raising to 10.0 will produce more heterogeneous group of sequences.
Multiple motifs (strings) can be given, separated by space. The strings can contain dots, suggesting insignificant residues.
- Significant Residues Motifs / Maximum Value Approach
- In the maximum-value approach, the total entropy of 100% conserved Leu or Val is 0.394 and the total entropy of 100% conserved Cys is 1.000.
- Significant Residues Motifs / Unscaled Sum Approach
- In the unscaled-sum approach, the total entropy of 100% conserved Leu is 0.497 and the total entropy of 100% conserved Cys is 1.000.
- Significant Residues Motifs / Scaled Sum Approach
- In the scaled sum approach, the total entropy of 100% conserved Leu is 0.545 and the total entropy of 100% conserved Tyr is 1.000.
- Similar Residues Motifs
- The similarity value is 1.000 for absolutely conserved positions and 0.000 for very variable positions. The scale is not linear, reasonably conserved positions have values larger or equal to 0.500.
- Copy and paste your sequence of interest in the form and use the run BLAST search option to find sequences related to your sequence.
- Put your sequences into a ClustalW input file (i.e. Fasta format) and use the upload file option.
- Use the download file option to automagically connect your web service
with PCPMer. (See, for example, the "PCPMer analysis" type of display in the
Flavitrack database.)
http://landau.utmb.edu:8080/pcpmer/Tools/alignments.jsp?alignmentDownload=LINK_TO_THE_ALIGNMENT
- Significant Residues Motifs
-
PCPMer traditionally recognised significant columns
of a multiple sequence alignment by means of relative entropy,
also known as the Kullback-Leibler divergence.
This approach will identify positions with small probabilities of
spontaneous occurance, without these being conserved across the
column of the multiple sequence alignment. For example, a column
containing the Glu and Pro residues in the ratio 1:1 will be
identified as highly unlikely and therefore significant, although
both amino acids are very different. Another example, a column
rich in Leu and Ile residues has small relative entropy, although
the physicochemical properties are well conserved. This is due
to the fact that these residues are very abundant.
The Kullback-Leibler divergence is defined as
where Q are the discrete probability distributions of the five property values observed in the alignment and P are the probability distributions of a random sample. In this equation, the index p iterates over the five properties E1 to E5 and the index i over the discrete probability distribution bins (the 20 amino acids were grouped into 5 bins for each vector, so n=5). Thus Q(i) is the fraction of the component p observed in the bin i and P(i) is the corresponding background frequency. - Similar Residues Motifs
-
An alternative definition of a motif is based on the overall physicochemical
similarity of amino acids across a column of a multiple sequence alignment.
The physicochemical distance between two amino acids i and j
is defined as the Euclidean distance in the five-dimensional space of the property
descriptors E1-E5 (the square-root term in Equation 1 below). The average
physicochemical distance D of a whole column of a multiple alignment is then
defined as the average of all pairs' distances (Equation 1).
In order to display the physicochemical distances using a color scale, a fixed range of values is required. PCPMer therefore displays the similarity values instead of physicochemical distances, as defined in Equation 2. The similarity is equal to 1 for absolutely conserved columns and quickly falls to 0 for more diverse columns. The similarity also contains a term which lowers its value when gaps are present in the column of the alignment.
where Vip are the five quantitative descriptor values of the i-th amino acid; N is the number of sequences in the multiple alignment; Nno gaps is the number of sequences not containing a gap in the given column; D is the physicochemical distance; and S is the similarity.
Multiple substrings can be given, separated by space. The strings can contain dots, suggesting insignificant residues. The substring must be unique within the reference sequence.
Note: This dialog is not the same as the Create New Motif dialog.
- - The shaded areas in the first row indicate recognized motifs.
- - The columns with asterisk (*) indicate absolutely conserved columns.
- - When the mouse cursor is positioned over the one-letter codes of the reference sequence, the index of the residue is displayed. The first number is relative to the reference sequence and the second number is the column of the multiple alignment.
- - The red fields in the "similarity" row correspond to variable columns, green fields correspond to conserved columns. When the mouse cursor is positioned over the fiels, the calculated value of similarity will be displayed (a number from the interval 0-1). The similarity of absolutely conserved columns is 1.
- - The gray fields in the rows D1-D5 show the average physicochemical distances for each of the five properties. The dark colors correspond to low distances and therefore to more conserved columns. When the mouse cursor is positioned over the fields, the calculated value of physicochemical value will be displayed. The physicochemical distance of absolutely conserved columns is 0.
- - The red fields in the "total E" row correspond to variable columns with low relative entropies, blue fields correspond to columns with high relative entropies. The actual value of relative entropy is displayed when the mouse cursor is positioned over the fields.
- - The gray fields E1-E5 show the relative entropies for each of the five properties. The dark colors correspond to high entropies, light colors to low entropies.
- - The bottom part of the table shows which motifs would be recognized with different cutoff values.
- Search the Astral database and find sequences which contain selected motifs. (Are you interested in other than Astral databases? Let us know!)
- User can enter a protein sequence and Motif Search will determine how well the motifs match the sequence at each position of the sequence. Multiple sequences in fasta format may be given.
- - The searched motifs section shows the characteristics of the searched motifs. The fields marked with x indicate which of the five quantitative descriptors exceeded the significance threshold. The unmarked fields were not used when scoring the motif against a sequence.
- - In the "search within a sequence" mode, the motifs are aligned with every position of the sequence and scores are calculated. The score for each position is displayed on a color scale. White indicates low scores, blue represents high scores. In the example below, the blue D indicates that the motif DRGWGNGC..FGKG gives almost a perfect match with the sequence at that position. When the mouse cursor is positioned over the fields, the actual value of the score and the position is displayed. Below, the top five positions are shown in detail and aligned with the motif.
- - In the "search in a database" mode,
the motifs are also aligned with every position of each sequence from the database and scores are calculated.
However, only the
best matching positions
are displayed and the
total score
is calculated. Best scoring sequences are listed first. Note that the total score is used only for
determining the relative order of the best matches and its actual value is meaningless and cannot be compared
with values obtained in different searches.
The calculation of the total score is based on the score values (Seq) for each motif and on the average scores calculated for each sequence of the multiple alignment (Aln) and for each sequence of the database (DB). If score filter is turned on, these values decide whether a motif will make a contribution to the total score or not. With fixed cutoff set, all motifs with the score value ( Seq ) smaller than the cutoff will be ignored. With mean cutoff set, all motifs with the score value ( Seq ) smaller than ( Aln + DB )/2 will be ignored.
The motifs colored red have been evaluated as random, non-significant matches, because their score is within two standard deviations from the average score ( Seq < DB avg + 2* DB dev ). That is, with this level of accuracy, one can find very many motifs. The motifs colored black deviate somewhat from the standard distribution in the database and the motifs colored green deviate significantly. Note that the color code is for basic orientation only. It is up to the user to decide which hits can be considered as biologically meaningful.
When the mouse cursor is positioned over the fields, the position of the amino acid in the reference sequence will be displayed. When the icon in the "Description" column is clicked, the sequence is displayed without aligned motifs, a feature useful for copying the sequence into clipboard and pasting it elsewhere.
Choose on of the Fixed Cutoff or Exclude non-matching options to filter non-significant motifs from the results, individually for each sequence.
Detailed explanation:
MotifSearch walks through a protein database and for each
of the sequences calculates a total score for the given set of motifs.
The total
score for a given sequence is based on "similarity" scores calculated for each of
the motifs. The similarity score is a value between 0 and 1, where the extreme
value of 1 indicates that the sequence contains a perfect match for the motif.
The Score Filter enables to drop non-significant matches out of the total score evaluation (and from the display). The Fixed Cutoff option filters out those contributions to the total score, which are smaller then the given threshold. The Exclude non-matching option does the same, but sets the cutoff dynamically, for each motif separately. The dynamic thresholds are set to two standard deviations from the average score value, calculated for the whole protein database. In other words, although a motif may have the score value quite low for the given sequence, it still may be identified as important as long as it gives higher score for the given sequence than for the other sequences in the database.
Standard ClustalW output:
Clustal 2.0 APE_H._sapiens ALYEDPPDHKT APE_M._musculus VLYEDPPDQKT APE_R._norvegicus VLYEDPPDQKT APE_H._sapiens SPSGKP----- APE_M._musculus SPSGKS----- APE_R._norvegicus SASGKS-----or each sequence on one line:
APE_H._sapiens ALYEDPPDHKTSPSGKP----- APE_M._musculus VLYEDPPDQKTSPSGKS----- APE_R._norvegicus VLYEDPPDQKTSASGKS-----or PIR format:
>APE_H._sapiens ALYEDPPDHKTSPSGKP----- >APE_M._musculus VLYEDPPDQKTSPSGKS----- >APE_R._norvegicus VLYEDPPDQKTSASGKS-----
This option works only with the "Background frequencies" option.
- Bacteria SwissProt 56.0
- Five quantitative descriptors, five bins, background frequencies from Bacteria SwissProt 56.0 (70,064,437 residues).
- Eukaryota SwissProt 56.0
- Five quantitative descriptors, five bins, background frequencies from Eukaryota SwissProt 56.0 (61,250,470 residues).
- Flaviviruses
- Five quantitative descriptors, five bins, background frequencies from 33 Flaviviruses (112,295 residues).
- Human SwissProt 56.0
- Five quantitative descriptors, five bins, background frequencies from Human SwissProt 56.0 (11,101,461 residues).
- SwissProt 40.0
- Five quantitative descriptors, five bins, background frequencies from SwissProt 40.0 (37,308,249 residues). Venkatarajan, M.S., Braun W., 2001, J. Mol. Modeling 7:445-453.
- SwissProt 56.0
- Five quantitative descriptors, five bins, background frequencies from SwissProt 56.0 (141,206,021 residues).
- Viruses SwissProt 56.0
- Five quantitative descriptors, five bins, background frequencies from Viruses SwissProt 56.0 (5,657,599 residues).
- Venkatarajan, Braun 2001
- Five quantitative descriptors, introduced in
Venkatarajan, M.S., Braun W., 2001, J. Mol. Modeling 7:445-453.
The five dimensions roughly correspond to hydrophobicity/hydrophylicity (E1); size (E2); alpha-helix propensity (E3). The property E4 is partially related to the partial specific volume, number of codons and relative abundance of the amino acids; and E5 correlates weakly with beta-strand propensity.
The substitutions must be specified as a comma-separated list of records ORI:NEW such as for example X:-,B:-,O:-.
- 1. Column of the multiple sequence alignment. (Indexed from 1.)
- 2. Number of sequences in the multiple sequence alignment.
- 3. Number of sequences in the multiple sequence alignment.
- 4. The line of the .gprf file (indexed from 1). Corresponds to the index of the reference sequence (when gaps absent) or of the alignment column (when gaps present).
- 5. Amino acid of the reference sequence.
- 6-10. Average values of the quantitative descriptors for the given column of the multiple sequence alignment.
- 11-15. Standard deviations of the quantitative descriptors.
- 16-20. Relative entropies of the quantitative descriptors.
- 21. Total entropy.
- 22. Similarity.
- 23. Physicochemical distance.
Example:
1 25 25 1 T -2.6647 2.6822 -1.0055 -0.0089 -0.6522 9.8122 4.7013 2.7967 2.5571 3.1612 0.0280 0.1050 0.0365 0.1376 0.0283 0.1206 0.2222 14.6333 2 25 25 2 R 8.1694 -5.1435 2.0387 -1.2167 -3.1891 3.4569 5.4552 1.2068 3.3977 2.4961 0.1353 0.1069 0.2481 0.0739 0.1943 0.2727 0.3943 8.8977 3 25 25 3 C -5.2001 5.3389 1.5350 9.0179 -5.5305 3.2743 2.4343 0.5373 2.5872 1.8762 0.6186 0.1592 0.3585 0.8598 0.3071 0.8280 0.7964 1.8684 4 25 25 4 T -6.4642 2.2470 -3.3837 -2.6776 -1.8299 8.2224 3.0542 2.4973 2.0986 3.4671 0.1026 0.1244 0.1248 0.0416 0.0850 0.1720 0.3010 12.0053 5 25 25 5 H 6.1173 6.4708 0.1283 1.4389 0.2912 5.8624 8.9879 0.8977 2.2588 1.9235 0.1024 0.1548 0.1552 0.1688 0.0819 0.2384 0.2685 13.1477 6 25 25 6 L -11.1182 0.7667 -3.4099 -1.5254 0.1100 5.2309 3.1290 3.1153 3.9061 3.0940 0.2096 0.1117 0.1696 0.1342 0.0179 0.2312 0.3700 9.9420 ....
In the Maximum Value approach, those positions are recognised as significant, where one of the five entropies exceeds a given threshold.
In the Unscaled Sum approach, the sum of the five entropy values is taken as the total entropy.
In the Scaled Sum approach, the five entropy values are scaled to yield a "uniform" space, and then their sum is taken as the total entropy.
NOTE: This parameter has no effect for calculations based on similarity. (See the option Motifs based on above in Motif Maker or the option Variability plot based on above in 3D Variability.)
When the Use Floating Colors option is checked, the blue (resp. white or red) color will correspond to the largest (resp. average or least) similarity value found in the multiple sequence alignment.

,