[Bio] / FigWebPages / similarities_options.html Repository:
ViewVC logotype

Annotation of /FigWebPages/similarities_options.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (view) (download) (as text)

1 : golsen 1.1 <HTML>
2 :     <HEAD>
3 :     <TITLE>Explanation of Protein Similarities Options</TITLE>
4 :     </HEAD>
5 :    
6 :     <BIDY>
7 :     <H1><Center>Explanation of Protein Similarities Options</Center></H1>
8 :    
9 : golsen 1.2 The SEED has numerous (some might say too many) options for controlling
10 : golsen 1.1 the display of similarities. Some are fairly obvious, while others are
11 : golsen 1.3 less so. Learning to use the options allows you to do things that
12 : golsen 1.1 are not easily done using other tools.
13 :    
14 :     <H2>Background: Sequences and similarities in the SEED</H2>
15 :    
16 :     Some of the options relate to how the SEED stores sequences and similarities.
17 : golsen 1.4 The system is in many ways similar to that employed by NCBI in its
18 :     non-redundant BLAST databases. The primary idea is that the SEED protein
19 :     database is a merger of sequences from several different sources. It is
20 :     common to have the same sequence from some combination of GenBank, RefSeq,
21 :     UniProt, and KEGG, as well as the SEED genomes. In the NCBI non-redundant
22 :     databases, Identical sequences are represented by a single sequence entry and
23 :     a list of all the different sources. The SEED carries this one step further.
24 :     In the SEED, the sequences do not need to be identical. A sequence that is
25 :     identical to the carboxy-terminal portion of another sequence can be merged
26 :     into the entry for the longer sequence. The most common reason for this to
27 :     happen is that the sequences are based upon the same gene, but the assumed
28 :     start site is different for the entries. In instances of very closely
29 :     related organisms, it is also possible for proteins to be merged between
30 :     genomes. With the sequencing of multiple strains within a bacterial species,
31 :     this is now a fairly common occurrence.
32 : golsen 1.1
33 :     <p>
34 :     So, what does this mean to similarities? The similarities are computed only
35 : golsen 1.3 for the representative (longest) version of each protein. The results are
36 :     automatically adjusted to look like those that would be found for the
37 :     user-requested query. However, the SEED has no way of knowing which subject
38 : golsen 1.4 sequence (matching sequence) will be of greatest interest to the user. On
39 :     one hand, listing all equivalent sequences is time consuming, and clutters
40 :     the list with multiple versions of each match. On the other hand, the
41 :     description associated with the representative sequence might be less useful
42 :     than that associated with other entries. Perhaps the most annoying issue is
43 :     that the entry for the SEED genome (the FIG sequence) is often not the
44 :     representative, and hence is not visible without expanding the list to show
45 :     all equivalent sequences. A less obvious consequence is that without
46 :     expanding the list of matches, proteins identical to the query in the same or
47 :     other genomes will not be included in the table of <b><font
48 :     color=blue>Similarities</b></font>! (Of course they are already displayed as
49 :     "<b><font color=blue>Assignments for Essentially Identical
50 :     Proteins</b></font>", but you might not have realized that this is what that
51 :     list represents.)
52 : golsen 1.1
53 : golsen 1.3 <Small><BLOCKQUOTE>
54 : golsen 1.4 There are potentially unexpected, or even undesired consequences of this, but
55 :     it is often the case that closer consideration suggests that they are mostly
56 : golsen 1.1 harmless, or perhaps even blessings in disguise. In particular, the reported
57 : golsen 1.4 region of similarity can have a negative start coordinate, because the
58 :     similarity does extend beyond the reported start of the particular protein.
59 :     It is also the case that similarity scores are not adjusted for shorter
60 :     sequences that might not include the entire region of similarity (again, this
61 :     is only the case when the reported start point is a negative sequence
62 :     position).
63 : golsen 1.3 </BLOCKQUOTE></Small>
64 : golsen 1.1
65 :     <H2>Standard Options</H2>
66 :    
67 : golsen 1.4 <b><font color=blue>Max sims</b></font> is the number of similarities to
68 :     report. This is the number of entries in the table of similarities, not
69 :     necessarily the number of unique sequences that were "expanded". The number
70 :     reported can be less than this if there are fewer entries in the database that
71 :     satisfy all of the other criteria defined by the search options (by default,
72 :     the only limit is the E-value of the match).
73 :    
74 :     <p>
75 :     <b><font color=blue>Max expand</b></font> lets the user control the expansion
76 :     of the representative database sequence (which in this context should be
77 :     viewed as arbitrarily chosen) into the list of equivalent sequences.
78 :     Expanding is essential for two functions: Showing just FIG sequences and
79 :     showing sequences that are identical to the query. Beyond these two cases,
80 :     expanding at least some sequences is frequently useful for seeing what diverse
81 :     databases (<i>e.g.</i>, UniProt, KEGG and RefSeq) have to say about the most
82 :     significant matches. Sometimes it appears that fewer entries were expanded
83 :     because one or more expanded entries were filtered out by later tests
84 :     (<i>e.g.</i>, they were environmental sequences, or they did not have a FIG
85 :     ID). Regardless of the value of <b><font color=blue>Max expand</font></b>,
86 :     the process stops as soon as <b><font color=blue>Max sims</font></b>
87 :     similarities have been reported.
88 :    
89 :     <p>
90 :     <b><font color=blue>Max E-val</font></b> sets an upper limit on the E-value
91 :     (the expected number of random database matches this good or better) of
92 :     matches that will be displayed (that is, a lower limit on the significance).
93 :     The highest E-value actually reported is never greater than this value, but
94 :     can be less due to the E-value cutoff used in computing the original
95 :     similarities, and/or the limited number of similarities stored in the
96 :     database.
97 : golsen 1.1
98 :     <p>
99 : golsen 1.4 There is a pop-up menu to select the treatment of entries when they are
100 :     expanded (and even to force the expanding of additional entries):
101 : golsen 1.2
102 :     <ul>
103 : golsen 1.3
104 : golsen 1.4 <li><b><font color=blue>Show all databases</font></b> displays information for
105 :     all equivalent sequences for the first <b><font color=blue>Max
106 :     expand</font></b> entries that satisfy all of the other criteria. For any
107 :     additional matches, up to a total of <b><font color=blue>Max sims</font></b>,
108 : golsen 1.3 just the information for the "representative" sequence is shown.<p>
109 : golsen 1.2
110 : golsen 1.4 <li><b><font color=blue>Prefer FIG IDs (to max exp)</font></b> displays all
111 :     FIG IDs for the first <b><font color=blue>Max expand</font></b> entries that
112 :     satisfy all of the other criteria. If a set of equivalent sequences does not
113 :     include any with a FIG ID, the information for the "representative" sequence
114 :     is shown. For any additional matches, up to a total of <b><font
115 :     color=blue>Max sims</font></b>, the information for the "representative"
116 :     sequence is shown.<p>
117 :    
118 :     <li><b><font color=blue>Prefer FIG IDs (all)</font></b> displays all available
119 :     FIG IDs for entries that satisfy all of the other criteria. If a set of
120 :     equivalent sequences does not include any with a FIG ID, the information for
121 :     the "representative" sequence is shown. This option ignores <b><font
122 :     color=blue>Max expand</font></b>, and hence may take more time.<p>
123 :    
124 :     <li><b><font color=blue>Just FIG IDs (to max exp)</font></b> filters the
125 :     expanded similarities for entries with FIG IDs (genomes in the SEED). This
126 :     only affects expanded entries, so if the number FIG IDs found in <b><font
127 :     color=blue>Max expand</font></b> entries is less than <b><font color=blue>Max
128 :     sims</font></b>, and there are additional similarities that satisfy all other
129 :     criteria, the information for the "representative" sequence is shown.<p>
130 :    
131 :     <li><b><font color=blue>Just FIG IDs (all)</font></b> filters the expanded
132 :     similarities for entries with FIG IDs (genomes in the SEED). This option
133 :     ignores <b><font color=blue>Max expand</font></b>, and hence may take more
134 :     time.
135 : golsen 1.3
136 : golsen 1.2 </ul>
137 : golsen 1.1
138 :     <p>
139 : golsen 1.4 <b><font color=blue>Show Env. samples</font></b> is used to enable the
140 :     reporting of similarities to environmental sequences. By default,
141 :     environmental sequences are not reported because all of their annotations are
142 :     indirect. However, they may be displayed so that the user can annotate them,
143 :     or explore other properties, such as genomic context.
144 : golsen 1.1
145 :     <p>
146 : golsen 1.4 <b><font color=blue>Hide aliases</font></b> removes the aliases column from
147 :     the similarities table, primarily
148 : golsen 1.1 to save screen real estate.
149 :    
150 :     <p>
151 : golsen 1.4 <b><font color=blue>Sort by</font></b>:
152 : golsen 1.2
153 :     <ul>
154 : golsen 1.3
155 : golsen 1.4 <li><b><font color=blue>score</font></b> is the default option. This is
156 :     essentially sorting by the significance of the match, but avoids the problem
157 :     that E-values all become 0 when there are highly similar, large proteins.
158 :     This is equivalent to the behavior of the BLAST programs.<p>
159 :    
160 :     <li><b><font color=blue>percent identity*</font></b> and <b><font
161 :     color=blue>percent identity</font></b> are intuitives ways to find the most
162 :     similar sequences when some of them might be partial. With the bit score
163 :     order, partial sequences are near the end of the similarities list, even when
164 :     they are nearly identical to the query sequence. This can be very important
165 :     for finding homologs in partial genome sequences.<p>
166 :    
167 :     <li>bit <b><font color=blue>score per position*</font></b> and bit <b><font
168 :     color=blue>score per position</font></b> are another way to find the most
169 :     similar sequences when some of them might be partial. As sequences become
170 :     diverged, the bit score better captures the residual similarity by giving
171 :     positive score to common amino acid replacements.<p>
172 :    
173 :     For the above two options, the difference in the <b><font
174 :     color=blue>*</font></b> version (relative to the non-* version) is a small
175 :     sample penalty so that very short sequences will be less apt to randomly
176 :     appear very early in the list. As the order in the menu might suggest, the
177 :     <b><font color=blue>*</font></b> versions are recommended, though details of
178 :     their behavior might be changed the future.
179 : golsen 1.2
180 :     </ul>
181 :    
182 :     <p>
183 : golsen 1.4 <b><font color=blue>Group by genome</font></b> is a useful method to collect
184 :     paralogs within a genome to the same location in the similarities list.
185 :     There are several properties of the function that need to be understood to
186 :     use it safely and effectively:
187 : golsen 1.1
188 : golsen 1.2 <ul>
189 : golsen 1.1
190 : golsen 1.4 <li>The same set of similarities are displayed whether this option is
191 :     selected or not. That is, no special attempt is make to find and include
192 :     less significant matches in a genome than would otherwise be displayed (most
193 :     commonly limited by <b><font color=blue>Max sims</font></b>, or by Maximum
194 :     E-value). To maximize your chance of seeing paralogues, make <b><font
195 :     color=blue>Max sims</font></b> a very large number, such as 500 or 1000.<p>
196 : golsen 1.1
197 :     <li>The ability to identify the genome requires the FIG ID, so it is
198 : golsen 1.4 necessary to expand the entries. Again, this means a large value of <b><font
199 :     color=blue>Max expand</font></b>.<p>
200 : golsen 1.1
201 : golsen 1.4 <li>When expanding a large number of entries, you might want to use <b><font
202 :     color=blue>Just FIG IDs</font></b>.<p>
203 : golsen 1.1
204 : golsen 1.4 <li>Even with all these options, in large gene families (<i>e.g.</i>,
205 :     Translation elongation factor Tu), the limits on the number of similarities
206 :     in the database can limit finding of paralogues.<p>
207 :    
208 :     <li><b><font color=red>Be very aware</font></b> that less significant matches
209 :     are now intermixed with more significant matches. <b><font color=red>Do
210 :     not</font></b> mechanically run down the list and assign all of the matches
211 :     at the top of the list the same function!<p>
212 : golsen 1.1
213 :     <li>On the other hand, this is an excellent mode for selecting the top entries
214 : golsen 1.3 and making a tree that shows the paralog families!
215 :    
216 : golsen 1.2 </ul>
217 : golsen 1.1
218 :     <H2>Standard Buttons</H2>
219 :    
220 : golsen 1.4 The <b><font color=blue>resubmit</font></b> button is used to redisplay the
221 :     similarities after changing one or more of the options. This is also the
222 :     action taken if you press the return key while the cursor is in one of the
223 :     text boxes.
224 : golsen 1.1
225 :     <p>
226 : golsen 1.4 The <b><font color=blue>more similarities</font></b> button acts the same as
227 :     the resubmit button, but it also doubles the values of Max sims and Max
228 :     expand.
229 : golsen 1.1
230 :     <p>
231 : golsen 1.4 The <b><font color=blue>previous PEG</font></b> button (when present)
232 :     navigates to the next lower numbered protein coding gene in the genome. This
233 :     is not necessarily in the same contig, and in some genomes, it can be
234 :     anywhere. This operation conserves all of the similarities settings (unlike
235 :     navigating via the genome context table, or map).
236 : golsen 1.1
237 :     <p>
238 : golsen 1.4 The <b><font color=blue>next PEG</font></b> button (when present) navigates
239 :     to the next higher numbered protein coding gene in the genome. This is not
240 :     necessarily in the same contig, and in some genomes, it can be anywhere.
241 :     This operation conserves all of the similarities settings (unlike navigating
242 :     via the genome context table, or map).
243 : golsen 1.1
244 :     <p>
245 : golsen 1.4 The <b><font color=blue>more sim options</font></b> button does what it says,
246 :     it gives you more options.
247 : golsen 1.1
248 :     <H2>Extra Options</H2>
249 :    
250 : golsen 1.4 The <b><font color=blue>more sim options</font></b> button enables additional
251 :     options, with their default values. Hiding the options with the <b><font
252 :     color=blue>fewer sim options</font></b> button reverts all of the extra
253 :     options to their default values. That is, the display is never influenced by
254 :     options that are hidden from view.
255 :    
256 :     <p>
257 :     <b><font color=blue>Min similarity</font></b> defines another way to cut off
258 :     weak matches. In this case by percent <b><font
259 :     color=blue>identity</font></b> (as reported by BLAST, which comes with
260 :     caveats) or by bit <b><font color=blue>score per position</font></b>.
261 :     Although the latter is less intuitive than percent identity, it is probably
262 :     better for highly diverged sequences in that the most common amino acid
263 :     replacements still get a positive score. The effective range of this measure
264 :     is 0 to 2 bits. The measurement option is selected with the <b><font
265 :     color=blue>as defined by</font></b> pop-up menu.
266 :    
267 :     <p>
268 :     <b><font color=blue>Min query cover (%)</font></b> is used to eliminate
269 :     matches that only cover a small part of the query. Typically these are
270 :     matches to conserved domains, but they can also be matches to fragmentary
271 :     genes in the database.
272 :    
273 :     <p>
274 :     <b><font color=blue>Min subject cover (%)</font></b> is used to eliminate
275 :     matches that only cover a small part of the database sequence. Typically
276 :     these are matches to conserved domains, but they can also be matches to
277 :     multifunctional genes in the database.
278 : golsen 1.1
279 :     </BODY>
280 :     </HTML>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3