[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Annotation of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : olson 1.1 <h1>Backing Up Your Data</h1>
2 :     The data and code stored within the SEED are organized as follows:
3 :     <pre>
4 :     ~fig on a Mac: /Users/fig; on Linux: /home/fig
5 :     FIGdisk
6 :     dist source code
7 :     FIG
8 :     Tmp temporary files
9 :     Data data in readable form
10 :     </pre>
11 :     <br>
12 :     <br>
13 :     <ol>
14 :     <li>
15 :     The directory <b>FIGdisk</b> holds both the code and data for the
16 :     SEED. The data is loaded into a database system that stores the data
17 :     in a location external to FIGdisk, but otherwise a running SEED is
18 :     encapsulated within FIGdisk. A symbolic link to FIGdisk is maintained
19 :     in the directory ~fig.
20 :     <br>
21 :     <li>
22 :     Within FIGdisk there are a two key directories:
23 :     <br>
24 :     <br><ol><li>
25 :     <b>dist</b> contains the source code, and
26 :    
27 :     <li>
28 :     <b>FIG</b> contains the execution environment and Data.
29 :     </ol>
30 :     <br>
31 :     <li>
32 :     Within FIG, there are a number of directories. The most important are
33 :     <br>
34 :     <br>
35 :     <ol>
36 :     <li>
37 :     <b>Data</b>, which contains all of the data in a human-readable form,
38 :     and
39 :     <br>
40 :     <br>
41 :     <li>
42 :     <b>Tmp</b>, which contains the temporary files built by SEED in
43 :     response to commands.
44 :     </ol>
45 :     </ol>
46 :     <br>
47 :     Hence, to backup your data, you should simply copy the Data
48 :     directory. It should be backed up to a separate disk. Suppose that
49 :     /Volumes/Backup is a backup disk. Then,
50 :     <br>
51 :     <pre>
52 :     cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
53 :     gzip -r /Volumes/Backup/Data.Backup
54 :     </pre>
55 :     <br>
56 :     would be a reasonable way to make a backup. The copy preserves
57 :     permissions, copies recursively, and does not follow symbolic links.
58 :     <br>
59 :     <h1>Copying a Version of the SEED</h1>
60 :    
61 :     To make a second copy of the SEED (either for a friend or for yourself), you should use tar
62 :     to preserve a few symbolic links (which are relative, not absolute; this means that they can
63 :     be copied while still preserving the integrity of the whole system).
64 :     So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it
65 :     to /Volumes/To. Use
66 :     <pre>
67 :     cd /Volumes/From
68 :     tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)
69 :     </pre>
70 :     <p>This should produce the desired copy. In this case, suppose that we are in a
71 :     Mac OS X
72 :     environment, and <b>From</b> and <b>To</b> are firewire disks. To install the system on a friends
73 :     Mac, you would unmount <b>To</b>, plug it into the new machine, and then set the symbolic link to the active
74 :     FIGdisk using
75 :     <br>
76 :     </p>
77 :     <table border="1" bgcolor="#CCCCCC">
78 :     <tr>
79 :     <td width="403"><font face="Courier New, Courier, mono">cd ~fig</font></td>
80 :     <td width="285">&nbsp;</td>
81 :     </tr>
82 :     <tr>
83 :     <td><font face="Courier New, Courier, mono">rm FIGdisk</font></td>
84 :     <td># fails if there is no existing FIGdisk on the machine</td>
85 :     </tr>
86 :     <tr>
87 :     <td><font face="Courier New, Courier, mono">ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk</font></td>
88 :     <td>&nbsp;</td>
89 :     </tr>
90 :     <tr>
91 :     <td><font face="Courier New, Courier, mono">cd FIGdisk</font></td>
92 :     <td>&nbsp;</td>
93 :     </tr>
94 :     <tr>
95 :     <td><font face="Courier New, Courier, mono">cp CURRENT_RELEASE DEFAULT_RELEASE</font></td>
96 :     <td># Causes the new configuration to use the code that was running in the
97 :     original installation</td>
98 :     </tr>
99 :     </table>
100 :     <p>At this point, the newly-copied FIGdisk can be configured for use. The full
101 :     documentation for SEED installation can currently be found at the following
102 :     location in the SEED Wiki: </p>
103 :     <blockquote>
104 :     <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions"> http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
105 :     </blockquote>
106 :     <h1>Running Multiple Copies of the SEED</h1>
107 :    
108 :     For individual users that use the SEED to support comparative analysis, a single copy is completely
109 :     adequate. Adding genomes can usually be done without disrupting normal use, and a very occasional major
110 :     reorganization that runs over the weekend is not a big deal.
111 :     <p>
112 :     The situation is somewhat different when the system is being used to support a major sequencing/annotation
113 :     effort. In this case, you have a user community that is sensitive to disruptions of service, and you
114 :     have frequent demands to update versions of data. In this case, it is best to have two systems: the
115 :     <b>production system</b> is used to support the larger user community, and the <b>update system</b> is
116 :     used to prepare updated versions of the system. Even so, work stoppages of 2-5 hours will occur when
117 :     new releases are swapped in. To swap in new data from the update system to the production system,
118 :     you need to
119 :     <ol>
120 :     <li>stop all work on the production machine,
121 :     <li>do a peer-to-peer update from the production machine to the update machine to
122 :     capture all annotations and assignments,
123 :     <li> move the Data directory in the production machine to a backup location,
124 :     <li> move in a copy of the update Data directory, and
125 :     <li> run
126 :     <pre>
127 :     fig load_all
128 :     </pre>
129 :     to reload the production databases with the data from the newly inserted Data directory.
130 :     This will usually take several hours.
131 :     <li> make the production machine available for use.
132 :     </ol>
133 :     Our experience is that anytime a group wishes to share a common production environment,
134 :     this 2-system approach is the way to do it.
135 :     <br>
136 :     <h1>Adding a New Genome to an Existing SEED</h1>
137 :     To add a new genome to a running SEED is fairly easy, but there are a
138 :     number of details that do have to be handled with care.
139 :     <p>
140 :     The first thing to note is that the SEED does not include tools to call genes -- you are expected
141 :     to provide gene calls. This may change at some point, but for now you must call your own genes. A
142 :     number of good tools now exist in the public domain, and you will need to find one that seems adequate
143 :     for your needs.
144 :     <p>
145 :     Let us now
146 :     cover how to prepare the actual data. You need to construct a directory (in somewhere like ~fig/Tmp)
147 :     of the following form:
148 :     <br>
149 :     <table width="100%">
150 :     <tr>
151 :     <td><tt>GenomeId</tt></td>
152 :     <td></td>
153 :     <td></td>
154 :     <td></td>
155 :     <td>of the form xxxx.y where xxxx is the taxon ID and y is an integer</td>
156 :     </tr>
157 :    
158 :     <tr>
159 :     <td></td>
160 :     <td><tt>PROJECT</tt></td>
161 :     <td></td>
162 :     <td></td>
163 :     <td> a file containg a description of the source of the data</td>
164 :     </tr>
165 :    
166 :     <tr>
167 :     <td></td>
168 :     <td><tt>GENOME</tt></td>
169 :     <td></td>
170 :     <td></td>
171 :     <td>a file containing a single line identifying the genus, species and strain</td>
172 :     </tr>
173 :    
174 :     <tr>
175 :     <td></td>
176 :     <td><tt>TAXONOMY</tt></td>
177 :     <td></td>
178 :     <td></td>
179 :     <td>a file containing a single line containing the NCBI taxonomy</td>
180 :     </tr>
181 :    
182 :     <tr>
183 :     <td></td>
184 :     <td><tt>RESTRICTIONS</tt></td>
185 :     <td></td>
186 :     <td></td>
187 :     <td>a file containing a description of distribution restrictions (optional)</td>
188 :     </tr>
189 :    
190 :     <tr>
191 :     <td></td>
192 :     <td><tt>CONTIGS</tt></td>
193 :     <td></td>
194 :     <td></td>
195 :     <td>contigs in fasta format</td>
196 :     </tr>
197 :    
198 :     <tr>
199 :     <td></td>
200 :     <td><tt>assigned_functions</tt></td>
201 :     <td></td>
202 :     <td></td>
203 :     <td>function assignments for the protein-encoding genes (optional)</td>
204 :     </tr>
205 :    
206 :     <tr>
207 :     <td></td>
208 :     <td><tt>Features</tt></td>
209 :     </tr>
210 :    
211 :     <tr>
212 :     <td></td>
213 :     <td></td>
214 :     <td><tt>peg</tt></td>
215 :     </tr>
216 :    
217 :     <tr>
218 :     <td></td>
219 :     <td></td>
220 :     <td></td>
221 :     <td><tt>tbl</tt></td>
222 :     <td>describes locations and aliases for the protein-encoding genes</td>
223 :     </td>
224 :     </tr>
225 :    
226 :     <tr>
227 :     <td></td>
228 :     <td></td>
229 :     <td></td>
230 :     <td><tt>fasta</tt></td>
231 :     <td>fasta file of translations of the protein-encoding genes</td>
232 :     </td>
233 :     </tr>
234 :    
235 :     <tr>
236 :     <td></td>
237 :     <td></td>
238 :     <td><tt>rna</tt></td>
239 :     </tr>
240 :    
241 :     <tr>
242 :     <td></td>
243 :     <td></td>
244 :     <td></td>
245 :     <td><tt>tbl</tt></td>
246 :     <td>describes locations and aliases for the rna-encoding genes</td>
247 :     </td>
248 :     </tr>
249 :    
250 :     <tr>
251 :     <td></td>
252 :     <td></td>
253 :     <td></td>
254 :     <td><tt>fasta</tt></td>
255 :     <td>fasta file of the DNA corresponding to the genes</td>
256 :     </td>
257 :     </tr>
258 :    
259 :    
260 :     </table>
261 :    
262 :     <!--
263 :    
264 :     <pre>
265 :     GenomeID of the form xxxx.y where xxxx is the taxon ID and y is an integer
266 :    
267 :     PROJECT a file containg a description of the source of the data
268 :    
269 :     GENOME a file containing a single line identifying the genus, species and strain
270 :    
271 :     TAXONOMY a file containing a single line containing the NCBI taxonomy
272 :    
273 :     RESTRICTIONS a file containing a description of distribution restrictions (optional)
274 :    
275 :     contigs contigs in fasta format
276 :    
277 :     assigned_functions function assignments for the protein-encoding genes (optional)
278 :    
279 :     Features
280 :    
281 :     peg
282 :     tbl descibes locations and aliases for the protein-encoding genes
283 :    
284 :     fasta fasta file of translations of the protein-encoding genes
285 :    
286 :     rna
287 :     tbl describes locations and aliases for the rna-encoding genes
288 :    
289 :     fasta fasta file of the DNA corresponding to the genes
290 :     </pre>
291 :     -->
292 :     <br>
293 :     <br>
294 :     Let us expand on this very brief description:
295 :     <ol>
296 :     <li>
297 :     The name of the directory must be of the form xxxx.y where xxxx is the
298 :     taxon ID, and y is a sequence number. For example, 562.1 might be
299 :     used for <i>E.coli</i>, since 562 is the NCBI taxon ID for
300 :     <i>Escherichia coli</i>. The sequence number (y) is used to
301 :     distinguish multiple genomes having the same taxon ID.
302 :     <br><br>
303 :     <li>
304 :     The assigned_functions file contains assignments of function for the
305 :     protein-encoding genes. is of the form
306 :     <pre>
307 :     Id\tFunction\tConfidence (\t stands for a tab character)
308 :     </pre>
309 :     The Id must be a valid PEG Id. These are of the form:
310 :     <pre>
311 :     fig|xxxx.y.peg.z
312 :     </pre>
313 :     where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes
314 :     the peg (protein-encoding gene).
315 :     <br>
316 :     <i>Confidence</i> is a single character code:
317 :     <br>
318 :     <ul>
319 :     <li>a space for "normal"
320 :     <li>w for "weak"
321 :     <li>e for experimentally verified
322 :     <li>s for "strong evidence (but not experimental)"
323 :     </ul>
324 :     The second tab and the confidence code can be omitted (it will default to a space).
325 :     The assigned_functions file is optional. You can leave it blank and, after adding the genome
326 :     to the SEED, ask for automated assignments.
327 :     <br><br>
328 :     <li>
329 :     The tbl files specify the locations of genes, as well as any aliases. Each line in a tbl line
330 :     is of the form
331 :     <br>
332 :     <pre>
333 :     Id\tLocation\tAliases (the aliases are separated by tabs)
334 :     </pre>
335 :     The Id must conform to the fig|xxxx.y.peg.z format described above. The <i>Location</i> is of the form
336 :     <br>
337 :     <pre>
338 :     L1,L2,L3...Ln
339 :    
340 :     where each Li describes a region on a contig and is of the form
341 :    
342 :     <i>Contig_Begin_End</i> where
343 :    
344 :     Contig is the Id of the contig,
345 :     Begin is the position of the first character, and
346 :     End is the position of the last character
347 :     </pre>
348 :     <ul>
349 :     <li>if Begin > End, the region being described is on the complementary strand, and
350 :     <li>the End position is the last character preceding the stop codon (i.e., the region
351 :     corresponding to a protein-encoding gene is thought of as including all bases from the
352 :     first base of the start codon to the last base before the stop codon.
353 :     </ul>
354 :     For example,
355 :     <pre>
356 :     fig|562.1.peg.15 Escherichia_coli_K12_14168_15295 dnaJ b0015 sp|P08622 gi|16128009
357 :     </pre>
358 :     describes the <i>dnaJ</i> gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12.
359 :     The gene is from the genome 562.1, and it has 4 specified aliases.
360 :     <li>
361 :     The fasta files must have gene Ids that match tbl file entries. The <i>peg</i> fasta file contains translations,
362 :     while the <i>rna</i> fasta file contains DNA sequences.
363 :     <li>
364 :     Both the <i>peg</i> and the <i>rna</i> subdirectories are optional.
365 :     </ol>
366 :     <br>
367 :     The SEED provides a utility that can be used to produce such a directory from a GenBank entry. Thus,
368 :     <br>
369 :     <pre>
370 :     parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
371 :     </pre>
372 :     would attempt to produce a properly formatted directory (~/Tmp/562.4) containing
373 :     the data encoded in the GenBank entry from the file <i>genbank.entry.for.a.new.E.coli.genome</i>.
374 :     This script is far from perfect, and there is huge variance in encodings in GenBank
375 :     files. So, use it at your own risk (and, manually check the output).
376 :     <p>
377 :     You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory
378 :     to see examples of how it should be done.
379 :     <p>
380 :     So, supposing that you have built a valid directory (say, <i>/Users/fig/Tmp/562.4</i>), you can add the genome using
381 :     <pre>
382 :     fig add_genome /Users/fig/Tmp/562.4
383 :     </pre>
384 :     <br>
385 :     The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
386 :     be computed for the protein-encoding genes.
387 :    
388 :     <h1>Computing Similarities</h1>
389 :    
390 :     Adding a genome does not automatically get similarities computed for the new genome; it queues the request.
391 :     To get the similarities actually computed, you need to establish a computational environment on which
392 :     the blast runs will be made, and then initiate a request on the machine running the SEED.
393 :     <p>
394 :     This is not a completely trivial process because there are a variety of different ways to compute
395 :     similarities:
396 :     <ol>
397 :     <li> You can just compute them on the system running the SEED. This can take several days, but this
398 :     is often a perfectly reasonable way to get the job done.
399 :     <li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),
400 :     and you wish to just exploit these machines to do the blast runs.
401 :     <li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).
402 :     In this case, it makes sense to utilize a large computational resource, and this resource may either
403 :     be a local cluster or a service provided over the net.
404 :     </ol>
405 :     <br>
406 :     To establish the flexibility needed to support all of these alternatives, we implemented the following
407 :     approach:
408 :     <ul>
409 :     <li>
410 :     The user can describe one or more <b>similarity computational environments</b>
411 :     in a configuration file called <i>similarities.config</i>. The details of this encoding
412 :     are beyond the scope of this document.
413 :     These environments all represent potential ways to compute similarities.
414 :     <br>
415 :     <li>
416 :     When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,
417 :     he runs a program specifying a specific similarity computational environment. This causes all
418 :     the queued similarity requests to be batched up and sent off to the specified server (which may simply
419 :     be on the same machine). He would use the <b>generate_similarities</b> command specifying two parameters: the
420 :     first specifies a similarities computational environment, and the second specifies whether or not automated assignments
421 :     should be computed as the similarity computations complete and the results are installed.
422 :     As the similarities complete, they will automatically be installed. Further, if a set of similarities arrive
423 :     for a given protein-encoding gene, and if there is no current assignment of function for the gene,
424 :     an automated assignment may be computed. Whether or not such automated assignments are computed is determined
425 :     by the second parameter in the command used by the systems administrator to initiate the request. For example,
426 :     <pre>
427 :     generate_similarities local auto-assignments
428 :     </pre>
429 :     specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast
430 :     requests on this machine", and requests automated assignments for all protein-encoding genes that currently either
431 :     have no assigned function or have an assigned function that is "hypothetical".
432 :     </ul>
433 :     <br>
434 :    
435 :     We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined
436 :     interfaces for handling high-volume requests. At FIG, we will maintain a set of instructions on how to set up
437 :     your configuration to exploit these resources.
438 :    
439 :     <h1>Deleting Genomes from a Version of the SEED </h1>
440 :    
441 :     There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
442 :     when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
443 :     deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy
444 :     of the SEED containing a subset of the entire collection of genomes.
445 :     <p>
446 :     To delete a set of genomes from a running version of the SEED, just use
447 :     <pre>
448 :     fig delete_genomes G1 G2 ...Gn (where G1 G2 ... Gn designates a list of genomes)
449 :     </pre>
450 :     For example,
451 :     <pre>
452 :     fig delete_genomes 562.1
453 :     </pre>
454 :     could be used to delete a single genome with a genome ID of 562.1.
455 :     <p>
456 :     To make a copy with some genomes deleted to give to someone else requires a little different approach.
457 :     To extract a set of genomes from an existing version of the SEED, you need to run the command
458 :     <pre>
459 :     extract_genomes Which ExistingData ExtractedData
460 :     </pre>
461 :    
462 :     The first argument is either the word "unrestricted" or the name of a file containing a list of
463 :     genome IDs (the genomes that are to be retained in the extraction). The second argument is
464 :     the path to the current Data directory. The third argument specifies the name of a directory
465 :     that is created holding the extraction. Thus,
466 :     <pre>
467 :     extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
468 :     </pre>
469 :     would created the extracted Data directory for you. If you wish to then produce a fully distributable
470 :     version of the SEED from the existing version and the extracted Data directory, you would
471 :     use
472 :     <pre>
473 :     make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
474 :     rm -rf /Volumes/Tmp/ExtractedData
475 :     <<<< Bob, can you write make_a_SEED??? >>>
476 :     </pre>
477 :    
478 :     <h1>Periodic Reintegration of Similarities</h1>
479 :    
480 :     When the initial SEED was constructed, similarities were computed. For most similarities of the form
481 :     "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2. This is not always true,
482 :     since we truncate the number of similarities associated with any single Id (leaving us in a situation
483 :     in which we may have similarity recorded for Id1, but not Id2). When a genome is added, if Id1 was an added
484 :     protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2. This means that when looking
485 :     at genes from previously existing organisms, you never get links back to the added pegs. This is not totally
486 :     satisfactory.
487 :     <p>
488 :     Periodically, it is probably a good idea to "reinitegrate the similarities". This can be done by
489 :     just running
490 :     <pre>
491 :     reintegrate_sims
492 :     # update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
493 :     </pre>
494 :     The job will probably run for quite a while (perhaps as much as a day or two).
495 :    
496 :     <h1>Computing "Pins" and "Clusters"</h1>
497 :    
498 :     The SEED displays potentially significant clusters on prokaryotic chromosomes. In the
499 :     process of finding preserved contiguity, it computes "pins", which are simply a set of genes
500 :     that are believed to be orthologs that cluster with similar genes. If you add your own genome,
501 :     you will probably want to compute and enter these into the active database. This can be done
502 :     using
503 :     <pre>
504 :     compute_pins_and_clusters G1 G2 G3 ...
505 :     </pre>
506 :     where the arguments are genome Ids. Thus,
507 :     <pre>
508 :     compute_pins_and_clusters 562.4
509 :     </pre>
510 :     would compute and add entries for all of the <i>pegs</i> in genome 562.4.

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3