[Bio] / FigWebPages / similarities_options.html Repository:
ViewVC logotype

View of /FigWebPages/similarities_options.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (download) (as text) (annotate)
Mon Mar 21 21:39:32 2005 UTC (14 years, 6 months ago) by golsen
Branch: MAIN
CVS Tags: merge-bodev_news-3, rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, Root-bobdev_news, rast_rel_2008_09_30, caBIG-13Feb06-00, rast_rel_2010_0526, rast_rel_2014_0729, merge-trunktag-bobdev_news-1, rast_rel_2009_05_18, caBIG-05Apr06-00, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, caBIG-00-00-00, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, merge-trunktag-bodev_news-3, merge-bobdev_news-2, merge-bobdev_news-1, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, merge-trunktag-bobdev_news-2, rast_rel_2008_11_24, HEAD
Branch point for: Branch-bobdev_news
Changes since 1.3: +199 -160 lines
Add options to Similarities to override Max expand when asking for FIG IDs.
It gets annoying to have to constantly increase the value, and it is not
predictable how many sims will need to be expanded to get any desired
number of FIG IDs.

<HTML>
<HEAD>
<TITLE>Explanation of Protein Similarities Options</TITLE>
</HEAD>

<BIDY>
<H1><Center>Explanation of Protein Similarities Options</Center></H1>

The SEED has numerous (some might say too many) options for controlling
the display of similarities.  Some are fairly obvious, while others are
less so.  Learning to use the options allows you to do things that
are not easily done using other tools.

<H2>Background:  Sequences and similarities in the SEED</H2>

Some of the options relate to how the SEED stores sequences and similarities.
The system is in many ways similar to that employed by NCBI in its
non-redundant BLAST databases.  The primary idea is that the SEED protein
database is a merger of sequences from several different sources.  It is
common to have the same sequence from some combination of GenBank, RefSeq,
UniProt, and KEGG, as well as the SEED genomes.  In the NCBI non-redundant
databases, Identical sequences are represented by a single sequence entry and
a list of all the different sources. The SEED carries this one step further. 
In the SEED, the sequences do not need to be identical.  A sequence that is
identical to the carboxy-terminal portion of another sequence can be merged
into the entry for the longer sequence.  The most common reason for this to
happen is that the sequences are based upon the same gene, but the assumed
start site is different for the entries.  In instances of very closely
related organisms, it is also possible for proteins to be merged between
genomes.  With the sequencing of multiple strains within a bacterial species,
this is now a fairly common occurrence.

<p>
So, what does this mean to similarities?  The similarities are computed only
for the representative (longest) version of each protein.  The results are
automatically adjusted to look like those that would be found for the
user-requested query.  However, the SEED has no way of knowing which subject
sequence (matching sequence) will be of greatest interest to the user.  On
one hand, listing all equivalent sequences is time consuming, and clutters
the list with multiple versions of each match.  On the other hand, the
description associated with the representative sequence might be less useful
than that associated with other entries.  Perhaps the most annoying issue is
that the entry for the SEED genome (the FIG sequence) is often not the
representative, and hence is not visible without expanding the list to show
all equivalent sequences.  A less obvious consequence is that without
expanding the list of matches, proteins identical to the query in the same or
other genomes will not be included in the table of <b><font
color=blue>Similarities</b></font>! (Of course they are already displayed as
"<b><font color=blue>Assignments for Essentially Identical
Proteins</b></font>", but you might not have realized that this is what that
list represents.)

<Small><BLOCKQUOTE>
There are potentially unexpected, or even undesired consequences of this, but
it is often the case that closer consideration suggests that they are mostly
harmless, or perhaps even blessings in disguise.  In particular, the reported
region of similarity can have a negative start coordinate, because the
similarity does extend beyond the reported start of the particular protein. 
It is also the case that similarity scores are not adjusted for shorter
sequences that might not include the entire region of similarity (again, this
is only the case when the reported start point is a negative sequence
position).
</BLOCKQUOTE></Small>

<H2>Standard Options</H2>

<b><font color=blue>Max sims</b></font> is the number of similarities to
report.  This is the number of entries in the table of similarities, not
necessarily the number of unique sequences that were "expanded".  The number
reported can be less than this if there are fewer entries in the database that
satisfy all of the other criteria defined by the search options (by default,
the only limit is the E-value of the match).

<p>
<b><font color=blue>Max expand</b></font> lets the user control the expansion
of the representative database sequence (which in this context should be
viewed as arbitrarily chosen) into the list of equivalent sequences. 
Expanding is essential for two functions: Showing just FIG sequences and
showing sequences that are identical to the query. Beyond these two cases,
expanding at least some sequences is frequently useful for seeing what diverse
databases (<i>e.g.</i>, UniProt, KEGG and RefSeq) have to say about the most
significant matches.  Sometimes it appears that fewer entries were expanded
because one or more expanded entries were filtered out by later tests
(<i>e.g.</i>, they were environmental sequences, or they did not have a FIG
ID).  Regardless of the value of <b><font color=blue>Max expand</font></b>,
the process stops as soon as <b><font color=blue>Max sims</font></b>
similarities have been reported.

<p>
<b><font color=blue>Max E-val</font></b> sets an upper limit on the E-value
(the expected number of random database matches this good or better) of
matches that will be displayed (that is, a lower limit on the significance). 
The highest E-value actually reported is never greater than this value, but
can be less due to the E-value cutoff used in computing the original
similarities, and/or the limited number of similarities stored in the
database.

<p>
There is a pop-up menu to select the treatment of entries when they are
expanded (and even to force the expanding of additional entries):

<ul>

<li><b><font color=blue>Show all databases</font></b> displays information for
all equivalent sequences for the first <b><font color=blue>Max
expand</font></b> entries that satisfy all of the other criteria. For any
additional matches, up to a total of <b><font color=blue>Max sims</font></b>,
just the information for the "representative" sequence is shown.<p>

<li><b><font color=blue>Prefer FIG IDs (to max exp)</font></b> displays all
FIG IDs for the first <b><font color=blue>Max expand</font></b> entries that
satisfy all of the other criteria.  If a set of equivalent sequences does not
include any with a FIG ID, the information for the "representative" sequence
is shown.  For any additional matches, up to a total of <b><font
color=blue>Max sims</font></b>, the information for the "representative"
sequence is shown.<p>

<li><b><font color=blue>Prefer FIG IDs (all)</font></b> displays all available
FIG IDs for entries that satisfy all of the other criteria.  If a set of
equivalent sequences does not include any with a FIG ID, the information for
the "representative" sequence is shown.  This option ignores <b><font
color=blue>Max expand</font></b>, and hence may take more time.<p>

<li><b><font color=blue>Just FIG IDs (to max exp)</font></b> filters the
expanded similarities for entries with FIG IDs (genomes in the SEED).  This
only affects expanded entries, so if the number FIG IDs found in <b><font
color=blue>Max expand</font></b> entries is less than <b><font color=blue>Max
sims</font></b>, and there are additional similarities that satisfy all other
criteria, the information for the "representative" sequence is shown.<p>

<li><b><font color=blue>Just FIG IDs (all)</font></b> filters the expanded
similarities for entries with FIG IDs (genomes in the SEED).  This option
ignores <b><font color=blue>Max expand</font></b>, and hence may take more
time.

</ul>

<p>
<b><font color=blue>Show Env. samples</font></b> is used to enable the
reporting of similarities to environmental sequences.  By default,
environmental sequences are not reported because all of their annotations are
indirect.  However, they may be displayed so that the user can annotate them,
or explore other properties, such as genomic context.

<p>
<b><font color=blue>Hide aliases</font></b> removes the aliases column from
the similarities table, primarily
to save screen real estate.

<p>
<b><font color=blue>Sort by</font></b>:

<ul>

<li><b><font color=blue>score</font></b> is the default option.  This is
essentially sorting by the significance of the match, but avoids the problem
that E-values all become 0 when there are highly similar, large proteins. 
This is equivalent to the behavior of the BLAST programs.<p>

<li><b><font color=blue>percent identity*</font></b> and <b><font
color=blue>percent identity</font></b> are intuitives ways to find the most
similar sequences when some of them might be partial.  With the bit score
order, partial sequences are near the end of the similarities list, even when
they are nearly identical to the query sequence.  This can be very important
for finding homologs in partial genome sequences.<p>

<li>bit <b><font color=blue>score per position*</font></b> and bit <b><font
color=blue>score per position</font></b> are another way to find the most
similar sequences when some of them might be partial. As sequences become
diverged, the bit score better captures the residual similarity by giving
positive score to common amino acid replacements.<p>

For the above two options, the difference in the <b><font
color=blue>*</font></b> version (relative to the non-* version) is a small
sample penalty so that very short sequences will be less apt to randomly
appear very early in the list.  As the order in the menu might suggest, the
<b><font color=blue>*</font></b> versions are recommended, though details of
their behavior might be changed the future.

</ul>

<p>
<b><font color=blue>Group by genome</font></b> is a useful method to collect
paralogs within a genome to the same location in the similarities list. 
There are several properties of the function that need to be understood to
use it safely and effectively:

<ul>

<li>The same set of similarities are displayed whether this option is
selected or not.  That is, no special attempt is make to find and include
less significant matches in a genome than would otherwise be displayed (most
commonly limited by <b><font color=blue>Max sims</font></b>, or by Maximum
E-value).  To maximize your chance of seeing paralogues, make <b><font
color=blue>Max sims</font></b> a very large number, such as 500 or 1000.<p>

<li>The ability to identify the genome requires the FIG ID, so it is
necessary to expand the entries.  Again, this means a large value of <b><font
color=blue>Max expand</font></b>.<p>

<li>When expanding a large number of entries, you might want to use <b><font
color=blue>Just FIG IDs</font></b>.<p>

<li>Even with all these options, in large gene families (<i>e.g.</i>,
Translation elongation factor Tu), the limits on the number of similarities
in the database can limit finding of paralogues.<p>

<li><b><font color=red>Be very aware</font></b> that less significant matches
are now intermixed with more significant matches.  <b><font color=red>Do
not</font></b> mechanically run down the list and assign all of the matches
at the top of the list the same function!<p>

<li>On the other hand, this is an excellent mode for selecting the top entries
and making a tree that shows the paralog families!

</ul>

<H2>Standard Buttons</H2>

The <b><font color=blue>resubmit</font></b> button is used to redisplay the
similarities after changing one or more of the options.  This is also the
action taken if you press the return key while the cursor is in one of the
text boxes.

<p>
The <b><font color=blue>more similarities</font></b> button acts the same as
the resubmit button, but it also doubles the values of Max sims and Max
expand.

<p>
The <b><font color=blue>previous PEG</font></b> button (when present)
navigates to the next lower numbered protein coding gene in the genome.  This
is not necessarily in the same contig, and in some genomes, it can be
anywhere.  This operation conserves all of the similarities settings (unlike
navigating via the genome context table, or map).

<p>
The <b><font color=blue>next PEG</font></b> button (when present) navigates
to the next higher numbered protein coding gene in the genome.  This is not
necessarily in the same contig, and in some genomes, it can be anywhere. 
This operation conserves all of the similarities settings (unlike navigating
via the genome context table, or map).

<p>
The <b><font color=blue>more sim options</font></b> button does what it says,
it gives you more options.

<H2>Extra Options</H2>

The <b><font color=blue>more sim options</font></b> button enables additional
options, with their default values. Hiding the options with the <b><font
color=blue>fewer sim options</font></b> button reverts all of the extra
options to their default values.  That is, the display is never influenced by
options that are hidden from view.

<p>
<b><font color=blue>Min similarity</font></b> defines another way to cut off
weak matches.  In this case by percent <b><font
color=blue>identity</font></b> (as reported by BLAST, which comes with
caveats) or by bit <b><font color=blue>score per position</font></b>. 
Although the latter is less intuitive than percent identity, it is probably
better for highly diverged sequences in that the most common amino acid
replacements still get a positive score.  The effective range of this measure
is 0 to 2 bits.  The measurement option is selected with the <b><font
color=blue>as defined by</font></b> pop-up menu.

<p>
<b><font color=blue>Min query cover (%)</font></b> is used to eliminate
matches that only cover a small part of the query.  Typically these are
matches to conserved domains, but they can also be matches to fragmentary
genes in the database.

<p>
<b><font color=blue>Min subject cover (%)</font></b> is used to eliminate
matches that only cover a small part of the database sequence.  Typically
these are matches to conserved domains, but they can also be matches to
multifunctional genes in the database.

</BODY>
</HTML>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3