[Bio] / FigTutorial / DataMining.html Repository:
ViewVC logotype

Annotation of /FigTutorial/DataMining.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : olson 1.1 <h1>Data Mining Tools in the SEED</h1>
2 :     <h2>Introduction: pegs, fid2dna, translate, and translation_of</h2>
3 :     The SEED is a loosely structured, evolving system. It is by intent an
4 :     "executable prototype" in which new ideas are installed by numerous
5 :     participants. The set of tools that I describe here are descendants
6 :     of an experiment that we did with the WIT system. The basic idea is
7 :     to build a command-line capability to very easily extract data from an
8 :     integration like the SEED. To extract desired data, you invoke small
9 :     programs that are strung together to produce the desired output. It
10 :     is clearly a notion that derived from the original ideas developed in
11 :     the UNIX culture. Although we will not elaborate on the point here,
12 :     most of the precise design grew out of work done on logic programming
13 :     systems in the late 1980's and early 1990's.
14 :     <p>
15 :     To see how things work, let's consider a simple example:
16 :     <pre>
17 :     pegs 83333.1 | fid2dna | translate
18 :     </pre>
19 :     is a sequence of three commands that are strung together to get
20 :     translations of the genes in <i>Escherichia coli</i>. The way it
21 :     works is that
22 :     <ul>
23 :     <li><b>pegs</b> is a command that takes as arguments a list of genome
24 :     IDs (in this case we used only one - 83333.1). It writes out a stream
25 :     of the IDs of the PEGs (protein-encoding genes, often called CDSs) for
26 :     the set of genomes. You can see this by just typing
27 :     <pre>
28 :     pegs 83333.1 | head -5
29 :     </pre>
30 :     which will show the first 5 lines written by <b>pegs</b>.
31 :     <li> <b>fid2dna</b> is a utility that takes as input a stream of
32 :     feature IDs and outputs a FASTA formatted file of the DNA for the
33 :     features.
34 :     You might try
35 :     <pre>
36 :     pegs 83333.1 | fid2dna | head -50
37 :     </pre>
38 :     to see what comes out.
39 :     <li>Finally, <b>translate</b> is a utility that translates a file of
40 :     DNA sequences (in FASTA format) into a file of protein sequences in
41 :     FASTA format.
42 :     </ul>
43 :     This little example actually translates the DNA. The SEED includes
44 :     translations for each PEG. You could get them using
45 :     <pre>
46 :     pegs 83333.1 | translation_of
47 :     </pre>
48 :     As an exercise, we suggest that you run
49 :     <pre>
50 :     pegs 83333.1 | translation_of > tmp.SEED.translations
51 :     pegs 83333.1 | fid2dna | translate > tmp.translated.SEED.DNA
52 :     </pre>
53 :     and see how the two files compare. Would you expect them to be the
54 :     same? Why might they differ?
55 :    
56 :     <h2>A More Complex Example: Indirect Functional Coupling </h2>
57 :    
58 :     Functionally related genes tend to cluster on prokaryotic chromosomes.
59 :     This fact is the basis for a number of techniques that have been used
60 :     to gain clues relating to the function of "hypothetical proteins". As
61 :     the number of sequenced genomes grows, the power of techniques based
62 :     on statistical analysis of chromosomal clusters grows dramatically (I
63 :     have argued elsewhere that the information grows approximately as the
64 :     square of the number of genomes, but that is a story best left for
65 :     another document).
66 :     <p>
67 :     Inevitably, the question arises: <b>"What can one do with a gene when it
68 :     does not appear to be in a chromosomal cluster?"</b> The technique
69 :     that I will describe now has often been referred to as <i>indirect
70 :     coupling</i>.
71 :     Here is how it works:
72 :     <ol>
73 :     <li>
74 :     You start with a gene in some genome (prokaryotic or eukaryotic).
75 :     Call this gene <i>X</i>.
76 :     <li>
77 :     Compute all bi-directional best hits (BBHs) of <i>X</i>. These amount to
78 :     an attempt to find the corresponding genes in other genomes. You need
79 :     to remember that such projections are not completely reliable, but
80 :     they are at the heart of many of the inferences we make relating to
81 :     function.
82 :     Call this set of BBHs <i>S1</i>.
83 :     <li>
84 :     Now for each element of S1, we can check to see if it appears to be in
85 :     a chromosomal cluster, and if it is we can compute the genes that
86 :     appear to be <i>directly functional coupled</i>. Call this set of
87 :     genes that are fucntionally coupled to BBHs of <i>X</i> <i>S2</i>.
88 :     <li>
89 :     Now we want to locate the genes in the original genome that correspond
90 :     to genes in </S2>. Again, we estimate this correspondence using BBHs.
91 :     That is, we take BBHs of genes in <i>S2</i> that are in the same
92 :     genome as <i>X</i>. Call this set of genes <i>S3</i>. These are the
93 :     genes that are indirectly coupled to <i>X</i>.
94 :     <li>
95 :     Finally, look at the functions of the indirectly coupled genes. This
96 :     set of functions can often be hugely informative.
97 :     </ol>
98 :    
99 :     To illustrate this process and how it can be implemented using the
100 :     tools we provide with the SEED, let us go through the computation one
101 :     step at a time.
102 :    
103 :     The first step is to pick an initial gene. As an example, let me
104 :     choose a gene from <i>Methanocaldococcus jannaschii</i>, one of the first
105 :     genomes I studied. In the version of the SEED I am using at this
106 :     time, the peg is <i>fig|2190.1.peg.1537</i>. It is also known by a
107 :     number of other ids: <i>NP_248444.1</i>, <i>gi|15669631</i>,
108 :     <i>sp|Q58835</i>, and kegg|mja:MJ1440</i>, for example. You should
109 :     make a file containing just one line, and that line should contain one
110 :     of these ids. Call this file <i>initial.peg</i>.
111 :     Then
112 :     <pre>
113 :     bbhs < initial.peg > corresponding.genes.in.other.genomes
114 :     </pre>
115 :     should construct a file containing the BBHs of the single gene in the
116 :     file. If <i>initial.peg</i> had contained more genes, BBHs would have
117 :     been computed for all of the ids in the file.
118 :     <p>
119 :     You should make an initial file and run <b>bbhs</b> to make sure that
120 :     you understand this simple tool. If you wished to restrict yourself
121 :     to BBHs that had better similarity scores than, say, 1.0e-50, you
122 :     would use
123 :     <pre>
124 :     bbhs 1.0e-50 < initial.peg > corresponding.genes.in.other.genomes
125 :     </pre>
126 :     The tool <i>bbhs</i> computes BBHs for any id that occurs as a last
127 :     field in the tab-separated fields in lines from the input file. If there are numerous
128 :     PEGs in the input file, execution of <i>bbhs</i> can go for hours.
129 :     <p>
130 :     The next step involved computing genes that were functionally coupled
131 :     to those in <i>corresponding.genes.in.other.genomes</i>. This can be
132 :     done using
133 :    
134 :     <pre>
135 :     functionally_coupled 5000 1.0e-10 3 fast < corresponding.genes.in.other.genomes > coupled.in.other.genomes
136 :     </pre>
137 :     The tool <i>functionally_coupled</i> computes genes that are
138 :     functionally coupled (based on preserved contiguity) to those that
139 :     occur as last entries in the tab-separated fields of lines from the
140 :     input file. The parameters to the tool are as follows:
141 :     <ol>
142 :     <li>the maximum distance separating a gene from another that is
143 :     putatively clustered with it,
144 :     <li>
145 :     the maximum p-score for similarities used to find corresponding genes
146 :     between two clusters,
147 :     <li>
148 :     the minimum functional coupling score (which is one less than the number of
149 :     clusters in "significantly different" genomes), and
150 :     <li>
151 :     the word <i>fast</i> which causes the tool to attempt a fast technique
152 :     that sometimes misses couplings (this argument is optional). <i>Fast</i> is meant only relatively here:
153 :     execution can go for hours, even with the option set.
154 :     </ol>
155 :     The tool writes out 3-tuples containing the two functionally-coupled
156 :     genes separated by their coupling score.
157 :     <p>
158 :     Again, I urge you to actually go through this step and examine the
159 :     output file.
160 :     <p>
161 :     Once you have created the file <i>coupled.in.other.genomes</i>, it is
162 :     time to search for BBHs from the coupled genes back in the original
163 :     genome. This can be done using
164 :     <pre>
165 :     bbhs < coupled.in.other.genomes | restrict_to_genome initial.genome > restricted.to.initial.genome
166 :     </pre>
167 :     The <i>bbhs</i> tool is the same one we discussed above. The simple
168 :     tool <i>restrict_to_genome</i> takes a single argument which is a file
169 :     containing a list of genome ids. In this case the file contains just
170 :     a single line giving the id of the genome containing the initial gene
171 :     (which was, if you recall, 2190.1).
172 :     <p>
173 :     Finally, you should append the functions of the detected genes and
174 :     remove duplicates using
175 :     <pre>
176 :     function_of < restricted.to.initial.genome | sort -u > indirectly.coupled.in.original.genome
177 :     </pre>
178 :     The <i>function_of</i> tool takes the last field in each input line,
179 :     looks up the function (assuming the field was a PEG id), and writes
180 :     out the PEG id and function.
181 :     If you do all of this, you get a file containing
182 :     <pre>
183 :     fig|2190.1.peg.1170 Methanocaldococcus jannaschii Shikimate 5-dehydrogenase (EC 1.1.1.25)
184 :     fig|2190.1.peg.1551 Methanocaldococcus jannaschii 3-dehydroquinate dehydratase (EC 4.2.1.10)
185 :     fig|2190.1.peg.571 Methanocaldococcus jannaschii 3-phosphoshikimate 1-carboxyvinyltransferase (EC 2.5.1.19)
186 :     </pre>
187 :     For those of you who are familiar with aromatic amino acid
188 :     biosynthesis, these three functions do, in fact, perfectly bracket the
189 :     initial gene (fig|2190.1.peg.1537, which is the archael shikimate
190 :     kinase) in the pathway for chorismate biosynthesis. A number of us
191 :     partiucipated in predicting the function of this gene a number of
192 :     years ago. With the tools and data we have now, the predictions of function
193 :     for such genes is vastly simplified.
194 :    
195 :     You could have produced the three lines of output by stringing all of the
196 :     commands together using
197 :     <pre>
198 :     % bbhs < initial.peg |
199 :     > functionally_coupled 5000 1.0e-10 3 fast |
200 :     > bbhs |
201 :     > restrict_to_genomes initial.genome |
202 :     > function_of |
203 :     > sort -u > indirectly.coupled.in.original.genome
204 :     </pre>
205 :     So, there you have a pipeline of commands to compute indirect functional
206 :     coupling. You can adjust the similarities required for the BBHs, and
207 :     you can alter the minimum coupling score required, but that is
208 :     essentially the computation.
209 :     <p>
210 :     You might wish to play with the different parameters. In particular,
211 :     the "fast" parameter on <i>functionally_coupled</i> is optional, and
212 :     does alter the behaviour. How well it performs depends critically on
213 :     how many <i>pins</i> and <i>chromosomal clusters</i> have accurately
214 :     been precomputed. If you do not use the "fast" parameter, it can slow
215 :     things down by an order of magnitude. We do try to keep pins and
216 :     clusters in good shape, but ....
217 :     <p>
218 :     Finally, you could use <i>similar_to</i> rather than <i>bbhs</i>.
219 :     However, unless you proceed incrementally, checking all intermediate
220 :     results, this can produce a lot of false positives (i.e., a lot of
221 :     garbage).
222 :    
223 :     <h2>Locating New Forms of Enzymes </h2>
224 :    
225 :     Researchers have discovered that many enzymes have multiple,
226 :     nonhomologous forms. The best way to recognize such situations is by
227 :     constructing a <i>subsystem spreadsheet</i>. Once you have
228 :     constructed a spreadsheet for a subsystem, you often will find a set
229 :     of organisms in which the majority of functional roles can be matched
230 :     up with PEGs, but one or more columns remain empty. These often
231 :     represent cases in which an alternative form of the enzyme exists, and
232 :     one of the more interesting tasks in annotation is to search for the gene that
233 :     has not yet been identified as playing the missing role. In this
234 :     short section, we will illustrate one one to seek the missing gene
235 :     using the SEED data mining tools (you could, of course, also seek it
236 :     using the Web-based tools).
237 :     <p>
238 :     I will illustrate the technique using a search for the archaeal form
239 :     of the shikimate kinase. This example induces pleasant memories for
240 :     me, since this was one of the first bioinformatics predictions that I
241 :     was involved in that was then confirmed in the wet lab. At the time a
242 :     number of us worked out the prediction, our approachs were far from
243 :     systematic. Now it is time to formulate the approaches that
244 :     work and apply them routinely.
245 :     <p>
246 :     In this example, we had a known
247 :     version of the shikimate kinase, an enzyme used in aromatic
248 :     amino-acid biosynthesis. The known version occurs throughout the
249 :     bacteria and eukaryotes. However, while it was absolutely clear that
250 :     many archaea synthesize aromatic amino acids with essentially the same
251 :     pathway used by bacteria, locating the archaeal form of the shikimate
252 :     kinase had not yet been done. So, how might one go about locating the
253 :     archaeal form, given only a number of occurrences of the known
254 :     bacterial form?
255 :     <br>
256 :     The basic approach that we will illustrate here uses the following
257 :     steps:
258 :     <ul>
259 :     <li><b>Step 1</b>: You begin the approach using the known version of the
260 :     enzyme (any of the known versions). We will take a version from
261 :     <i>Brucella abortus</i>, fig|235.1.peg.189, but any of the bacterial
262 :     occurrences would work. We will call this the <i>initial
263 :     sequence</i>. We will begin by computing bi-directional best hits
264 :     (BBHs) of the initial sequence. Call the set composed of the initial
265 :     sequence and all of its BBHs, <b>Si: the set of known versions of the
266 :     enzyme</b>. Here is how we propose computing <b>Si</b>:
267 :     <pre>
268 :     bbhs iterate < initial > bbhs
269 :     cat initial bbhs > Si
270 :     </pre>
271 :     The command <i>bbhs</i> was used in our last example. In this case,
272 :     we have added an extra argument; by saying <i>iterate</i> we have
273 :     requested that the command compute BBHs of the PEGs in the file
274 :     initial, but then iterate by adding BBHs of any of the newly added
275 :     PEGs (and continue until no more new BBHs can be added).
276 :    
277 :     <li><b>Step 2</b>: Now we compute the set of genes that are
278 :     functionally coupled to the known version using
279 :    
280 :     <pre>
281 :     functionally_coupled 5000 1.0e-10 5 fast < Si > fc.Si
282 :     bbhs length=0.8 iterate < fc.Si > known.machinery
283 :     </pre>
284 :    
285 :     The genes in <b>fc.Si</b> should all occur in genomes that contain the
286 :     known version of the enzyme. This set represents "genes coupled to
287 :     the known version" and represents an estimate of the machinery that is
288 :     functionally related to the role played by the enzyme. If we then
289 :     project this machinery to all of the genomes (including those that do
290 :     not contain the known version) using <i>bbhs</i> we have the file <i>known.machinery</i>.
291 :    
292 :     <li><b>Step 3</b>: We then proceed by restricting the PEGs in <i>known.machinery</i> to those from
293 :     genomes that do not contain members of <i>Si</i>; that is, to those that might be from
294 :     genomes containing the alternative form. To do this we use
295 :     <pre>
296 :     genomes_of < Si > genomes.of.Si
297 :     restrict_to_genomes -v genomes.of.Si < known.machinery > known.in.genomes.without.orig.form
298 :     </pre>
299 :     The tool <i>genome_of</i> takes as input a file containing lines that end with PEG ids. It
300 :     outputs lines that contain the genome ids of all genomes containing those PEGs (i.e., those that occur
301 :     in the last column of the input file).
302 :     The tool <i>restrict_to_genomes</i> takes as input a file containing PEG ids in the
303 :     last column. It restricts this input set based on the genome ids of these PEGs, writing out those lines
304 :     that pass a test. The test is given by the command line arguments. If the command line contains a single
305 :     argument, it should be a file containing genome ids. In the case, the test amounts to "keep PEGs from
306 :     genomes corresponding to those in the given file". If the command line arguments are of the form
307 :     <i>-v File</>, then the test amounts to "keep only PEG ids from genomes other than those in the given file".
308 :    
309 :     <li><b>step 4</b>: Now that we have a file on known machinery in genomes that probably do not have the original
310 :     form of the enzyme, we use
311 :     <pre>
312 :     functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates
313 :     </pre>
314 :     to get genes that are functionally coupled to PEGs in <i>known.in.genomes.without.orig.form</i>. The trouble
315 :     with these "potential candidates" is that the file will contain instances of all components of the
316 :     machinery of the subsystem; what we really want is instances that bear no resemblance to either the original
317 :     form or to any other gene that was functionally related to the original form (or to any BBH of those!).
318 :     We achive this restriction using
319 :     <pre>
320 :     cat Si fc.Si known.machinery > things.to.ignore
321 :     restrict_by_similarity -v things.to.ignore 1.0e-3 < potential.candidates > candidates
322 :     </pre>
323 :     The <i>restrict_by_similarity</i> command can be used to restrict a set of PEGs to those that are
324 :     either similar to, or definitely not similar to, a set of given PEGs (in this case <i>things.to.ignore</i>).
325 :     In the invocation shown, we take those not similar to genes in <i>potential.candidates</i>, and we specify a similarity threshhold of 1.0e-3.
326 :     <li><b>Step 5</b>: Finally, we take the set of candidate pegs and cluster all that are BBHs of one another.
327 :     This gives us a sequence of clusters of BBHs, where all of the entries in each cluster were dredged up by
328 :     this fairly simple little procedure. The command to produce the clusters is as follows:
329 :     <pre>
330 :     cluster_by_bbhs < candidates > data.to.look.at
331 :     </pre>
332 :     This concludes my initial discussion of what I call the "sniffer algorithm" (with the image of a bloodhound in mind).
333 :     It is, perhaps, worth noting that we base this exploration almost entirely on projection (BBHs) and functional coupling on the chromosome. If one wanted to include use of other forms of evidence of functional coupling, it can
334 :     be done easily. For example, to add fusion data, one would simply change
335 :    
336 :     <pre>
337 :     functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates
338 :     </pre>
339 :    
340 :     to
341 :     <pre>
342 :     functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates
343 :     fusion_of < known.in.genomes.without.orig.form >> potential.candidates
344 :     </pre>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3