[Bio] / FigTutorial / DataMining.html Repository:
ViewVC logotype

Annotation of /FigTutorial/DataMining.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (view) (download) (as text)

1 : olson 1.1 <h1>Data Mining Tools in the SEED</h1>
2 : overbeek 1.3 <h2>A Sketchy Set of Notes Maintained by Ross Overbeek</h2>
3 :     <p>
4 :     This tutorial discusses a number of issues that you will need to know about
5 :     in order to use the data mining tools included in SEED. It is organized as follows:
6 :     </p>
7 :    
8 :     <ol>
9 :     <li><A HREF="#Introduction: pegs, fid2dna, translate, and translation_of">Introduction: pegs, fid2dna, translate, and translation_of</A>
10 :     <li><A HREF="#A More Complex Example: Indirect Functional Coupling">A More Complex Example: Indirect Functional Coupling</A>
11 :     <li><A HREF="#Extracting Upstream Regions">Extracting Upstream Regions</A>
12 :     <li><A HREF="#Locating New Forms of Enzymes">Locating New Forms of Enzymes</A>
13 :     </ol>
14 :     The first time through, I recommend reading it in order.
15 :    
16 :    
17 :     <h2 id="Introduction: pegs, fid2dna, translate, and translation_of" >Introduction: pegs, fid2dna, translate, and translation_of</h2>
18 : olson 1.1 The SEED is a loosely structured, evolving system. It is by intent an
19 :     "executable prototype" in which new ideas are installed by numerous
20 :     participants. The set of tools that I describe here are descendants
21 :     of an experiment that we did with the WIT system. The basic idea is
22 :     to build a command-line capability to very easily extract data from an
23 :     integration like the SEED. To extract desired data, you invoke small
24 :     programs that are strung together to produce the desired output. It
25 :     is clearly a notion that derived from the original ideas developed in
26 :     the UNIX culture. Although we will not elaborate on the point here,
27 :     most of the precise design grew out of work done on logic programming
28 :     systems in the late 1980's and early 1990's.
29 :     <p>
30 :     To see how things work, let's consider a simple example:
31 :     <pre>
32 :     pegs 83333.1 | fid2dna | translate
33 :     </pre>
34 :     is a sequence of three commands that are strung together to get
35 :     translations of the genes in <i>Escherichia coli</i>. The way it
36 :     works is that
37 :     <ul>
38 :     <li><b>pegs</b> is a command that takes as arguments a list of genome
39 :     IDs (in this case we used only one - 83333.1). It writes out a stream
40 :     of the IDs of the PEGs (protein-encoding genes, often called CDSs) for
41 :     the set of genomes. You can see this by just typing
42 :     <pre>
43 :     pegs 83333.1 | head -5
44 :     </pre>
45 :     which will show the first 5 lines written by <b>pegs</b>.
46 :     <li> <b>fid2dna</b> is a utility that takes as input a stream of
47 :     feature IDs and outputs a FASTA formatted file of the DNA for the
48 :     features.
49 :     You might try
50 :     <pre>
51 :     pegs 83333.1 | fid2dna | head -50
52 :     </pre>
53 :     to see what comes out.
54 :     <li>Finally, <b>translate</b> is a utility that translates a file of
55 :     DNA sequences (in FASTA format) into a file of protein sequences in
56 :     FASTA format.
57 :     </ul>
58 :     This little example actually translates the DNA. The SEED includes
59 :     translations for each PEG. You could get them using
60 :     <pre>
61 :     pegs 83333.1 | translation_of
62 :     </pre>
63 :     As an exercise, we suggest that you run
64 :     <pre>
65 :     pegs 83333.1 | translation_of > tmp.SEED.translations
66 :     pegs 83333.1 | fid2dna | translate > tmp.translated.SEED.DNA
67 :     </pre>
68 :     and see how the two files compare. Would you expect them to be the
69 :     same? Why might they differ?
70 :    
71 : overbeek 1.3 <h2 id="A More Complex Example: Indirect Functional Coupling">A More Complex Example: Indirect Functional Coupling </h2>
72 : olson 1.1
73 :     Functionally related genes tend to cluster on prokaryotic chromosomes.
74 :     This fact is the basis for a number of techniques that have been used
75 :     to gain clues relating to the function of "hypothetical proteins". As
76 :     the number of sequenced genomes grows, the power of techniques based
77 :     on statistical analysis of chromosomal clusters grows dramatically (I
78 :     have argued elsewhere that the information grows approximately as the
79 :     square of the number of genomes, but that is a story best left for
80 :     another document).
81 :     <p>
82 :     Inevitably, the question arises: <b>"What can one do with a gene when it
83 :     does not appear to be in a chromosomal cluster?"</b> The technique
84 :     that I will describe now has often been referred to as <i>indirect
85 :     coupling</i>.
86 :     Here is how it works:
87 :     <ol>
88 :     <li>
89 :     You start with a gene in some genome (prokaryotic or eukaryotic).
90 :     Call this gene <i>X</i>.
91 :     <li>
92 :     Compute all bi-directional best hits (BBHs) of <i>X</i>. These amount to
93 :     an attempt to find the corresponding genes in other genomes. You need
94 :     to remember that such projections are not completely reliable, but
95 :     they are at the heart of many of the inferences we make relating to
96 :     function.
97 :     Call this set of BBHs <i>S1</i>.
98 :     <li>
99 :     Now for each element of S1, we can check to see if it appears to be in
100 :     a chromosomal cluster, and if it is we can compute the genes that
101 :     appear to be <i>directly functional coupled</i>. Call this set of
102 :     genes that are fucntionally coupled to BBHs of <i>X</i> <i>S2</i>.
103 :     <li>
104 :     Now we want to locate the genes in the original genome that correspond
105 :     to genes in </S2>. Again, we estimate this correspondence using BBHs.
106 :     That is, we take BBHs of genes in <i>S2</i> that are in the same
107 :     genome as <i>X</i>. Call this set of genes <i>S3</i>. These are the
108 :     genes that are indirectly coupled to <i>X</i>.
109 :     <li>
110 :     Finally, look at the functions of the indirectly coupled genes. This
111 :     set of functions can often be hugely informative.
112 :     </ol>
113 :    
114 :     To illustrate this process and how it can be implemented using the
115 :     tools we provide with the SEED, let us go through the computation one
116 :     step at a time.
117 :    
118 :     The first step is to pick an initial gene. As an example, let me
119 :     choose a gene from <i>Methanocaldococcus jannaschii</i>, one of the first
120 :     genomes I studied. In the version of the SEED I am using at this
121 :     time, the peg is <i>fig|2190.1.peg.1537</i>. It is also known by a
122 :     number of other ids: <i>NP_248444.1</i>, <i>gi|15669631</i>,
123 :     <i>sp|Q58835</i>, and kegg|mja:MJ1440</i>, for example. You should
124 :     make a file containing just one line, and that line should contain one
125 :     of these ids. Call this file <i>initial.peg</i>.
126 :     Then
127 :     <pre>
128 :     bbhs < initial.peg > corresponding.genes.in.other.genomes
129 :     </pre>
130 :     should construct a file containing the BBHs of the single gene in the
131 :     file. If <i>initial.peg</i> had contained more genes, BBHs would have
132 :     been computed for all of the ids in the file.
133 :     <p>
134 :     You should make an initial file and run <b>bbhs</b> to make sure that
135 :     you understand this simple tool. If you wished to restrict yourself
136 :     to BBHs that had better similarity scores than, say, 1.0e-50, you
137 :     would use
138 :     <pre>
139 :     bbhs 1.0e-50 < initial.peg > corresponding.genes.in.other.genomes
140 :     </pre>
141 :     The tool <i>bbhs</i> computes BBHs for any id that occurs as a last
142 :     field in the tab-separated fields in lines from the input file. If there are numerous
143 :     PEGs in the input file, execution of <i>bbhs</i> can go for hours.
144 :     <p>
145 :     The next step involved computing genes that were functionally coupled
146 :     to those in <i>corresponding.genes.in.other.genomes</i>. This can be
147 :     done using
148 :    
149 :     <pre>
150 :     functionally_coupled 5000 1.0e-10 3 fast < corresponding.genes.in.other.genomes > coupled.in.other.genomes
151 :     </pre>
152 :     The tool <i>functionally_coupled</i> computes genes that are
153 :     functionally coupled (based on preserved contiguity) to those that
154 :     occur as last entries in the tab-separated fields of lines from the
155 :     input file. The parameters to the tool are as follows:
156 :     <ol>
157 :     <li>the maximum distance separating a gene from another that is
158 :     putatively clustered with it,
159 :     <li>
160 :     the maximum p-score for similarities used to find corresponding genes
161 :     between two clusters,
162 :     <li>
163 :     the minimum functional coupling score (which is one less than the number of
164 :     clusters in "significantly different" genomes), and
165 :     <li>
166 :     the word <i>fast</i> which causes the tool to attempt a fast technique
167 :     that sometimes misses couplings (this argument is optional). <i>Fast</i> is meant only relatively here:
168 :     execution can go for hours, even with the option set.
169 :     </ol>
170 :     The tool writes out 3-tuples containing the two functionally-coupled
171 :     genes separated by their coupling score.
172 :     <p>
173 :     Again, I urge you to actually go through this step and examine the
174 :     output file.
175 :     <p>
176 :     Once you have created the file <i>coupled.in.other.genomes</i>, it is
177 :     time to search for BBHs from the coupled genes back in the original
178 :     genome. This can be done using
179 :     <pre>
180 :     bbhs < coupled.in.other.genomes | restrict_to_genome initial.genome > restricted.to.initial.genome
181 :     </pre>
182 :     The <i>bbhs</i> tool is the same one we discussed above. The simple
183 :     tool <i>restrict_to_genome</i> takes a single argument which is a file
184 :     containing a list of genome ids. In this case the file contains just
185 :     a single line giving the id of the genome containing the initial gene
186 :     (which was, if you recall, 2190.1).
187 :     <p>
188 :     Finally, you should append the functions of the detected genes and
189 :     remove duplicates using
190 :     <pre>
191 :     function_of < restricted.to.initial.genome | sort -u > indirectly.coupled.in.original.genome
192 :     </pre>
193 :     The <i>function_of</i> tool takes the last field in each input line,
194 :     looks up the function (assuming the field was a PEG id), and writes
195 :     out the PEG id and function.
196 :     If you do all of this, you get a file containing
197 :     <pre>
198 :     fig|2190.1.peg.1170 Methanocaldococcus jannaschii Shikimate 5-dehydrogenase (EC 1.1.1.25)
199 :     fig|2190.1.peg.1551 Methanocaldococcus jannaschii 3-dehydroquinate dehydratase (EC 4.2.1.10)
200 :     fig|2190.1.peg.571 Methanocaldococcus jannaschii 3-phosphoshikimate 1-carboxyvinyltransferase (EC 2.5.1.19)
201 :     </pre>
202 :     For those of you who are familiar with aromatic amino acid
203 :     biosynthesis, these three functions do, in fact, perfectly bracket the
204 :     initial gene (fig|2190.1.peg.1537, which is the archael shikimate
205 :     kinase) in the pathway for chorismate biosynthesis. A number of us
206 :     partiucipated in predicting the function of this gene a number of
207 :     years ago. With the tools and data we have now, the predictions of function
208 :     for such genes is vastly simplified.
209 :    
210 :     You could have produced the three lines of output by stringing all of the
211 :     commands together using
212 :     <pre>
213 :     % bbhs < initial.peg |
214 :     > functionally_coupled 5000 1.0e-10 3 fast |
215 :     > bbhs |
216 :     > restrict_to_genomes initial.genome |
217 :     > function_of |
218 :     > sort -u > indirectly.coupled.in.original.genome
219 :     </pre>
220 :     So, there you have a pipeline of commands to compute indirect functional
221 :     coupling. You can adjust the similarities required for the BBHs, and
222 :     you can alter the minimum coupling score required, but that is
223 :     essentially the computation.
224 :     <p>
225 :     You might wish to play with the different parameters. In particular,
226 :     the "fast" parameter on <i>functionally_coupled</i> is optional, and
227 :     does alter the behaviour. How well it performs depends critically on
228 :     how many <i>pins</i> and <i>chromosomal clusters</i> have accurately
229 :     been precomputed. If you do not use the "fast" parameter, it can slow
230 :     things down by an order of magnitude. We do try to keep pins and
231 :     clusters in good shape, but ....
232 :     <p>
233 :     Finally, you could use <i>similar_to</i> rather than <i>bbhs</i>.
234 :     However, unless you proceed incrementally, checking all intermediate
235 :     results, this can produce a lot of false positives (i.e., a lot of
236 :     garbage).
237 :    
238 : overbeek 1.3 <h2 id="Extracting Upstream Regions">Extracting Upstream Regions</h2>
239 : overbeek 1.2
240 :     Suppose that you wished to search upstream regions of genes for
241 :     control sites. To do this, you need to get sets of genes that might
242 :     share a common control site, extract the upstream regions, and then
243 :     search these for a common regulatory site. In this section we will
244 :     discuss how to extract the desired upstream regions.
245 :     <p>
246 :     Normally, one would select a set of potentially co-regulated genes.
247 :     This might be done using something like
248 :     <pre>
249 :     % pegs_in_subsystem "Histidine Biosynthesis" | grep 'fig|211586.1.peg' > related
250 :     % cat related
251 :     ATP phosphoribosyltransferase (EC 2.4.2.17) fig|211586.1.peg.1886
252 :     Phosphoribosyl-ATP pyrophosphatase (EC 3.6.1.31) fig|211586.1.peg.1879
253 :     Phosphoribosyl-AMP cyclohydrolase (EC 3.5.4.19) fig|211586.1.peg.1879
254 :     Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) fig|211586.1.peg.1881
255 :     Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) fig|211586.1.peg.3771
256 :     Imidazole glycerol phosphate synthase subunit hisH (EC 2.4.2.-) fig|211586.1.peg.1882
257 :     Imidazole glycerol phosphate synthase subunit hisF (EC 4.1.3.-) fig|211586.1.peg.1880
258 :     Imidazoleglycerol-phosphate dehydratase (EC 4.2.1.19) fig|211586.1.peg.1883
259 :     Histidinol-phosphate aminotransferase (EC 2.6.1.9) fig|211586.1.peg.1884
260 :     Histidinol-phosphatase (EC 3.1.3.15) fig|211586.1.peg.1883
261 :     Histidinol dehydrogenase (EC 1.1.1.23) fig|211586.1.peg.1885
262 :     %
263 :     </pre>
264 :     <br>
265 :     The command <i>pegs_in_subsystem</i> can be used to extract all of the
266 :     genes from the spreadsheet encoding the designated subsystem (in this case,
267 : overbeek 1.4 <i>Histidine Biosynthesis</i>). The command
268 : overbeek 1.2 produces a 2-column (tab-separated) file in which the
269 :     first column is a functional role and the second is a PEG assigned
270 :     that function. By using <i>grep</i> to pull out just those genes for
271 :     a given genome (in this case 211586.1, which is <i>Shewanella
272 :     oneidensis MR-1</i>), we get a potentially co-regulated set of genes.
273 :     This may or may not be exactly what you wish -- you must still be
274 :     careful of situations in which the genes occur in a single operon, but
275 :     that is a detail for later.
276 :     <p>
277 : overbeek 1.3 Once you have a set of potentially co-regulated genes, you might use
278 : overbeek 1.2 something like
279 :     <pre>
280 :     upstream upstream=200 plus=100 < related > upstream.fasta
281 :     </pre>
282 :     <br>
283 :     which produces a file like
284 :     <pre>
285 :     >fig|211586.1.peg.1886
286 :     acccaccatcatcaccaccatatgcggtagccttcggacgttcactgccggaggactata
287 :     tccttctggcggactgattcaaagatgatcataaacccctggaaggaattccgggggttt
288 :     ttgcttttaggcttttaataaacgatttaacagctaattagtaattaatttcaatatgtt
289 :     aaccgccacatttactggcg-GTGCCGATAAGCCAGTTAAAGCGTGAAACGAAAGAATAC
290 :     CATTTTGAATTTGTTAACTTTGAGCGAGATAGAATTATGACCGAATCAAACCGTTTACGT
291 :     A
292 :     >fig|211586.1.peg.1879
293 :     GTTTATCGATGCGAAGGTGGATGCAGCACTGGCGGCCAGCGTGTTCCATAAAGCCATTAT
294 :     CAATATCGGCGAGTTAAAACAGTATTTAGCCGCCGAAGGTATTGCTATTAGACGCtagag
295 :     cggttctaggttctaggttctaggttctaggttctaggttctaggttctaggttctacaa
296 :     atagattaagagtgaatact-ATGTCAACGCAAACAAACACTAAGTCAGATTTTAGCGAG
297 :     CTGGATTGGGACAAACAGGATGGGCTTATCCCTGCTGTTATCCAAAACCACTTATCAGGC
298 :     A
299 :     >fig|211586.1.peg.1881
300 :     ATCGGTAAAGACAACTTTATGGGCGTGCAATTCCACCCTGAAAAAAGTGCCGCCGTGGGC
301 :     GCGCAAATCCTAGGTAACTTTTTAAAAATGCAAtaaatgcaatcgcttattaggtttggc
302 :     tgagccaccactatttttcagccacaccatacgacaataaagcgcggcagcaaatgcacc
303 :     aaaagaattaaggacaacag-ATGATCATTCCAGCGATTGATTTAATTGATGGCAAGGTA
304 :     GTCAGGCTCTACCAAGGGGATTATGGGCAGCAGACCACCTTCGATTTAAGCCCCCTCGCC
305 :     C
306 :     .
307 :     .
308 :     .
309 :     </pre>
310 :     Here, the parameters to upstream have the following meanings:
311 :     <ul>
312 :     <li>
313 :     <b>upstream=200</b> specifies that the output is to contain 200 bp of
314 :     upstream region, if possible (i.e., if the gene is not at the
315 :     end of a contig), and
316 :     <li>
317 :     <b>plus=100</b> specifies that you wish to include 100 bp at the start
318 :     of the gene.
319 :     </ul>
320 :     The output contains one fasta entry for each gene. A "-" separates the upstream sequence
321 :     from the start of the gene's coding sequence. Lowercase is used for the
322 :     upstream sequence, until it overlaps another gene (and from there on you get uppercase).
323 :     <p>
324 :     The key to actually extracting accurate upstream operators in prokaryotes has turned out
325 :     to be the use of comparative analysis with corresponding genes from closely related organisms.
326 :     This requires determining a set of "corresponding genes from closely related organisms" for each gene in the
327 :     potentially co-regulated set.
328 :     One way to do this might be to use
329 :     <pre>
330 : overbeek 1.3 pegs_in_subsystem "Histidine Biosynthesis" | grep "ATP phosphoribosyltransferase (EC 2.4.2.17)"
331 : overbeek 1.2 </pre>
332 :     <br>
333 :     which produces a file something like
334 :     <br>
335 :     <pre>
336 :     ATP phosphoribosyltransferase (EC 2.4.2.17) fig|83333.1.peg.1994
337 :     ATP phosphoribosyltransferase (EC 2.4.2.17) fig|272626.1.peg.573
338 :     ATP phosphoribosyltransferase (EC 2.4.2.17) fig|224308.1.peg.3498
339 :     ATP phosphoribosyltransferase (EC 2.4.2.17) fig|169963.1.peg.563
340 :     .
341 :     .
342 :     .
343 :     </pre>
344 :     <br>
345 :     Note that all we have done here is to extract a column (rather than a row) from the
346 :     subsystem spreadsheet.
347 :     <p>
348 :     You might, of course, wish to consider corresponding genes as defined by sequence similarity,
349 :     rather than using a subsystem spreadsheet. In this case, you need to get all of the genes
350 :     from other organisms that pass some similarity threshhold. To do this, you might use something like
351 :     <br>
352 :     <pre>
353 :     echo 'fig|211586.1.peg.1886' | similar_to > genes.similar.to.1886
354 :     </pre>
355 :     <br>
356 :     This gives you the genes similar to <i>fig|211586.1.peg.1886</i>. If you wished to get
357 :     the upstream regions for the original gene and its related genes, then
358 :     <pre>
359 :     (echo 'fig|211586.1.peg.1886'; echo 'fig|211586.1.peg.1886' | similar_to) | upstream
360 :     </pre>
361 :     would work (here we just use default parameters on <i>similar_to</i> and <i>upstream</i>).
362 :     To do this for an entire <i>Shewanella</i> genome I used
363 :     <pre>
364 :     % for i in `pegs 211586.1`
365 :     > do
366 : overbeek 1.3 > (echo "$i"; echo "$i" | similar_to | is_prokaryotic) | upstream upstream=500 plus=100 > "Output/$i"
367 :     > echo $i
368 : overbeek 1.2 > done
369 :     </pre>
370 :     while running in <i>bash</i>. The <i>is_prokaryotic</i> filter just removes non-prokaryotic genes from the
371 : overbeek 1.3 input to <i>upstream</i>. The <ii>is_eukaryotic</i>, <i>is_bacterial</i>, and <i>is_archaeal</i> filters now
372 :     work, as well.
373 : overbeek 1.2
374 : overbeek 1.3 <h2 id="Locating New Forms of Enzymes">Locating New Forms of Enzymes </h2>
375 : olson 1.1
376 :     Researchers have discovered that many enzymes have multiple,
377 :     nonhomologous forms. The best way to recognize such situations is by
378 :     constructing a <i>subsystem spreadsheet</i>. Once you have
379 :     constructed a spreadsheet for a subsystem, you often will find a set
380 :     of organisms in which the majority of functional roles can be matched
381 :     up with PEGs, but one or more columns remain empty. These often
382 :     represent cases in which an alternative form of the enzyme exists, and
383 :     one of the more interesting tasks in annotation is to search for the gene that
384 :     has not yet been identified as playing the missing role. In this
385 :     short section, we will illustrate one one to seek the missing gene
386 :     using the SEED data mining tools (you could, of course, also seek it
387 :     using the Web-based tools).
388 :     <p>
389 :     I will illustrate the technique using a search for the archaeal form
390 :     of the shikimate kinase. This example induces pleasant memories for
391 :     me, since this was one of the first bioinformatics predictions that I
392 :     was involved in that was then confirmed in the wet lab. At the time a
393 :     number of us worked out the prediction, our approachs were far from
394 :     systematic. Now it is time to formulate the approaches that
395 :     work and apply them routinely.
396 :     <p>
397 :     In this example, we had a known
398 :     version of the shikimate kinase, an enzyme used in aromatic
399 :     amino-acid biosynthesis. The known version occurs throughout the
400 :     bacteria and eukaryotes. However, while it was absolutely clear that
401 :     many archaea synthesize aromatic amino acids with essentially the same
402 :     pathway used by bacteria, locating the archaeal form of the shikimate
403 :     kinase had not yet been done. So, how might one go about locating the
404 :     archaeal form, given only a number of occurrences of the known
405 :     bacterial form?
406 :     <br>
407 :     The basic approach that we will illustrate here uses the following
408 :     steps:
409 :     <ul>
410 :     <li><b>Step 1</b>: You begin the approach using the known version of the
411 :     enzyme (any of the known versions). We will take a version from
412 :     <i>Brucella abortus</i>, fig|235.1.peg.189, but any of the bacterial
413 :     occurrences would work. We will call this the <i>initial
414 :     sequence</i>. We will begin by computing bi-directional best hits
415 :     (BBHs) of the initial sequence. Call the set composed of the initial
416 :     sequence and all of its BBHs, <b>Si: the set of known versions of the
417 :     enzyme</b>. Here is how we propose computing <b>Si</b>:
418 :     <pre>
419 :     bbhs iterate < initial > bbhs
420 :     cat initial bbhs > Si
421 :     </pre>
422 :     The command <i>bbhs</i> was used in our last example. In this case,
423 :     we have added an extra argument; by saying <i>iterate</i> we have
424 :     requested that the command compute BBHs of the PEGs in the file
425 :     initial, but then iterate by adding BBHs of any of the newly added
426 :     PEGs (and continue until no more new BBHs can be added).
427 :    
428 :     <li><b>Step 2</b>: Now we compute the set of genes that are
429 :     functionally coupled to the known version using
430 :    
431 :     <pre>
432 :     functionally_coupled 5000 1.0e-10 5 fast < Si > fc.Si
433 :     bbhs length=0.8 iterate < fc.Si > known.machinery
434 :     </pre>
435 :    
436 :     The genes in <b>fc.Si</b> should all occur in genomes that contain the
437 :     known version of the enzyme. This set represents "genes coupled to
438 :     the known version" and represents an estimate of the machinery that is
439 :     functionally related to the role played by the enzyme. If we then
440 :     project this machinery to all of the genomes (including those that do
441 :     not contain the known version) using <i>bbhs</i> we have the file <i>known.machinery</i>.
442 :    
443 :     <li><b>Step 3</b>: We then proceed by restricting the PEGs in <i>known.machinery</i> to those from
444 :     genomes that do not contain members of <i>Si</i>; that is, to those that might be from
445 :     genomes containing the alternative form. To do this we use
446 :     <pre>
447 :     genomes_of < Si > genomes.of.Si
448 :     restrict_to_genomes -v genomes.of.Si < known.machinery > known.in.genomes.without.orig.form
449 :     </pre>
450 :     The tool <i>genome_of</i> takes as input a file containing lines that end with PEG ids. It
451 :     outputs lines that contain the genome ids of all genomes containing those PEGs (i.e., those that occur
452 :     in the last column of the input file).
453 :     The tool <i>restrict_to_genomes</i> takes as input a file containing PEG ids in the
454 :     last column. It restricts this input set based on the genome ids of these PEGs, writing out those lines
455 :     that pass a test. The test is given by the command line arguments. If the command line contains a single
456 :     argument, it should be a file containing genome ids. In the case, the test amounts to "keep PEGs from
457 :     genomes corresponding to those in the given file". If the command line arguments are of the form
458 :     <i>-v File</>, then the test amounts to "keep only PEG ids from genomes other than those in the given file".
459 :    
460 :     <li><b>step 4</b>: Now that we have a file on known machinery in genomes that probably do not have the original
461 :     form of the enzyme, we use
462 :     <pre>
463 :     functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates
464 :     </pre>
465 :     to get genes that are functionally coupled to PEGs in <i>known.in.genomes.without.orig.form</i>. The trouble
466 :     with these "potential candidates" is that the file will contain instances of all components of the
467 :     machinery of the subsystem; what we really want is instances that bear no resemblance to either the original
468 :     form or to any other gene that was functionally related to the original form (or to any BBH of those!).
469 :     We achive this restriction using
470 :     <pre>
471 :     cat Si fc.Si known.machinery > things.to.ignore
472 :     restrict_by_similarity -v things.to.ignore 1.0e-3 < potential.candidates > candidates
473 :     </pre>
474 :     The <i>restrict_by_similarity</i> command can be used to restrict a set of PEGs to those that are
475 :     either similar to, or definitely not similar to, a set of given PEGs (in this case <i>things.to.ignore</i>).
476 :     In the invocation shown, we take those not similar to genes in <i>potential.candidates</i>, and we specify a similarity threshhold of 1.0e-3.
477 :     <li><b>Step 5</b>: Finally, we take the set of candidate pegs and cluster all that are BBHs of one another.
478 :     This gives us a sequence of clusters of BBHs, where all of the entries in each cluster were dredged up by
479 :     this fairly simple little procedure. The command to produce the clusters is as follows:
480 :     <pre>
481 :     cluster_by_bbhs < candidates > data.to.look.at
482 :     </pre>
483 :     This concludes my initial discussion of what I call the "sniffer algorithm" (with the image of a bloodhound in mind).
484 :     It is, perhaps, worth noting that we base this exploration almost entirely on projection (BBHs) and functional coupling on the chromosome. If one wanted to include use of other forms of evidence of functional coupling, it can
485 :     be done easily. For example, to add fusion data, one would simply change
486 :    
487 :     <pre>
488 :     functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates
489 :     </pre>
490 :    
491 :     to
492 :     <pre>
493 :     functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates
494 :     fusion_of < known.in.genomes.without.orig.form >> potential.candidates
495 :     </pre>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3