[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Annotation of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.14 - (view) (download) (as text)

1 : olson 1.2 <h1>SEED Administration</h1>
2 : gdpusch 1.8
3 :     <p>
4 :     This tutorial discusses a number of issues that you will need to know about
5 :     in order to install, share, and maintain your SEED installation.
6 :     It is organized as follows:
7 :     </p>
8 :    
9 :     <ul>
10 :     <li><A HREF="#backups">
11 :     Backing Up Your Data
12 :     </A>
13 :    
14 :     <li><A HREF="#copying">
15 :     Copying a Version of the SEED
16 :     </A>
17 :    
18 :     <li><A HREF="#multiple_copies">
19 :     Running Multiple Copies of the SEED
20 :     </A>
21 :    
22 :     <li><A HREF="#adding_genomes">
23 :     Adding a New Genome to an Existing SEED
24 :     </A>
25 :    
26 :     <li><A HREF="#sims">
27 :     Computing Similarities
28 :     </A>
29 :    
30 :     <li><A HREF="#deleting_genomes">
31 :     Deleting Genomes from a Version of the SEED
32 :     </A>
33 :    
34 :     <li><A HREF="#reintegrate_sims">
35 :     Periodic Reintegration of Similarities
36 :     </A>
37 :    
38 :     <li><A HREF="#pins_and_clusters">
39 :     Computing "Pins" and "Clusters"
40 :     </A>
41 :    
42 : gdpusch 1.13 <li><A HREF="#auto_annotation">
43 :     Automatic Annotation of Genomes
44 :     </A>
45 :    
46 : gdpusch 1.8 </ul>
47 :    
48 :    
49 :     <h2 id="backups">Backing Up Your Data</h2>
50 : olson 1.1 The data and code stored within the SEED are organized as follows:
51 :     <pre>
52 :     ~fig on a Mac: /Users/fig; on Linux: /home/fig
53 :     FIGdisk
54 :     dist source code
55 :     FIG
56 :     Tmp temporary files
57 :     Data data in readable form
58 :     </pre>
59 : olson 1.2 <ol><li>
60 : olson 1.1 The directory <b>FIGdisk</b> holds both the code and data for the
61 :     SEED. The data is loaded into a database system that stores the data
62 :     in a location external to FIGdisk, but otherwise a running SEED is
63 :     encapsulated within FIGdisk. A symbolic link to FIGdisk is maintained
64 :     in the directory ~fig.
65 :     <br>
66 :     <li>
67 :     Within FIGdisk there are a two key directories:
68 :     <br>
69 :     <br><ol><li>
70 :     <b>dist</b> contains the source code, and
71 :    
72 :     <li>
73 :     <b>FIG</b> contains the execution environment and Data.
74 :     </ol>
75 :     <br>
76 :     <li>
77 :     Within FIG, there are a number of directories. The most important are
78 :     <br>
79 :     <br>
80 :     <ol>
81 :     <li>
82 :     <b>Data</b>, which contains all of the data in a human-readable form,
83 :     and
84 :     <br>
85 :     <br>
86 :     <li>
87 :     <b>Tmp</b>, which contains the temporary files built by SEED in
88 :     response to commands.
89 :     </ol>
90 :     </ol>
91 :     <br>
92 :     Hence, to backup your data, you should simply copy the Data
93 :     directory. It should be backed up to a separate disk. Suppose that
94 :     /Volumes/Backup is a backup disk. Then,
95 :     <br>
96 :     <pre>
97 : overbeek 1.12 cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
98 : olson 1.1 gzip -r /Volumes/Backup/Data.Backup
99 :     </pre>
100 :     <br>
101 :     would be a reasonable way to make a backup. The copy preserves
102 :     permissions, copies recursively, and does not follow symbolic links.
103 :     <br>
104 : gdpusch 1.8 <h2 id="copying">Copying a Version of the SEED</h2>
105 : olson 1.1
106 :     To make a second copy of the SEED (either for a friend or for yourself), you should use tar
107 :     to preserve a few symbolic links (which are relative, not absolute; this means that they can
108 :     be copied while still preserving the integrity of the whole system).
109 :     So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it
110 :     to /Volumes/To. Use
111 :     <pre>
112 :     cd /Volumes/From
113 :     tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)
114 :     </pre>
115 :     <p>This should produce the desired copy. In this case, suppose that we are in a
116 :     Mac OS X
117 :     environment, and <b>From</b> and <b>To</b> are firewire disks. To install the system on a friends
118 :     Mac, you would unmount <b>To</b>, plug it into the new machine, and then set the symbolic link to the active
119 :     FIGdisk using
120 :     <br>
121 :     </p>
122 :     <table border="1" bgcolor="#CCCCCC">
123 :     <tr>
124 :     <td width="403"><font face="Courier New, Courier, mono">cd ~fig</font></td>
125 :     <td width="285">&nbsp;</td>
126 :     </tr>
127 :     <tr>
128 :     <td><font face="Courier New, Courier, mono">rm FIGdisk</font></td>
129 :     <td># fails if there is no existing FIGdisk on the machine</td>
130 :     </tr>
131 :     <tr>
132 :     <td><font face="Courier New, Courier, mono">ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk</font></td>
133 :     <td>&nbsp;</td>
134 :     </tr>
135 :     <tr>
136 : olson 1.2 <td><font face="Courier New, Courier, mono">bash</font></td>
137 :     <td>Switch to using the bash shell</td>
138 :     </tr>
139 :     <tr>
140 : olson 1.1 <td><font face="Courier New, Courier, mono">cd FIGdisk</font></td>
141 :     <td>&nbsp;</td>
142 :     </tr>
143 :     <tr>
144 : olson 1.2 <td height="23"><font face="Courier New, Courier, mono">cp CURRENT_RELEASE DEFAULT_RELEASE</font></td>
145 : olson 1.1 <td># Causes the new configuration to use the code that was running in the
146 :     original installation</td>
147 :     </tr>
148 : olson 1.2 <tr>
149 :     <td height="23"><font face="Courier New, Courier, mono">./configure <em>arch-name</em></font></td>
150 :     <td># Configure the new SEED disk for architecture <em>arch-name</em>. </td>
151 :     </tr>
152 :     <tr>
153 :     <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
154 :     </font></td>
155 :     <td># Set up the environment for using the SEED</td>
156 :     </tr>
157 :     <tr>
158 :     <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
159 :     </font></td>
160 :     <td># Start the database server and registration servers</td>
161 :     </tr>
162 :     <tr>
163 :     <td height="23"><font face="Courier New, Courier, mono">init_FIG <br>
164 :     </font></td>
165 :     <td># Initialize a new relational database</td>
166 :     </tr>
167 :     <tr>
168 :     <td height="23"><font face="Courier New, Courier, mono">fig load_all</font></td>
169 :     <td># Load the database from the SEED data files. This may take several hours</td>
170 :     </tr>
171 :     </table>
172 :     <p>At this point, the new SEED copy should be ready to use. You only need to
173 :     perform the configure, init_FIG, and fig load_all steps once after installing
174 :     a new copy of the SEED. After a reboot or other clean start of the computer,
175 :     you will only have to do these steps:</p>
176 : olson 1.3 <table border="1" bgcolor="#EEEEEE">
177 : olson 1.2 <tr>
178 :     <td width="403"><font face="Courier New, Courier, mono">cd ~fig/FIGdisk</font></td>
179 :     <td width="285">&nbsp;</td>
180 :     </tr>
181 :     <tr>
182 :     <td><font face="Courier New, Courier, mono">bash</font></td>
183 :     <td>Switch to using the bash shell</td>
184 :     </tr>
185 :     <tr>
186 :     <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
187 :     </font></td>
188 :     <td># Set up the environment for using the SEED</td>
189 :     </tr>
190 :     <tr>
191 :     <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
192 :     </font></td>
193 :     <td># Start the database server and registration servers</td>
194 :     </tr>
195 : olson 1.1 </table>
196 : olson 1.2 <p>Upon setting up a new computer for running SEED, you should read the full
197 :     documentation for SEED installation, as it has a number of platform-specific
198 :     modifications that need to be performed. This document can currently be found
199 :     at the following
200 :     location in the SEED Wiki: </p>
201 : olson 1.1 <blockquote>
202 :     <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions"> http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
203 :     </blockquote>
204 : gdpusch 1.8 <h2 id="multiple_copies">Running Multiple Copies of the SEED</h2>
205 : olson 1.1
206 :     For individual users that use the SEED to support comparative analysis, a single copy is completely
207 :     adequate. Adding genomes can usually be done without disrupting normal use, and a very occasional major
208 :     reorganization that runs over the weekend is not a big deal.
209 :     <p>
210 :     The situation is somewhat different when the system is being used to support a major sequencing/annotation
211 :     effort. In this case, you have a user community that is sensitive to disruptions of service, and you
212 :     have frequent demands to update versions of data. In this case, it is best to have two systems: the
213 :     <b>production system</b> is used to support the larger user community, and the <b>update system</b> is
214 : overbeek 1.7 used to prepare updated versions of the system.
215 :     New genomes are added to the update system, and then periodically a
216 :     revised Data directory is extracted to update the production system.
217 :     Even so, work stoppages of a few hours will occur when
218 :     new releases are swapped in.
219 :     <p>
220 :     This use of an "update" and a "production" system is quite analogous
221 :     to running a production system which is occasionally updated from new
222 :     Data DVDs (which FIG normally makes available about every 4-6 months).
223 :     That is, in both cases you are updating a production system from a
224 :     newly created <b>Data</b> directory that is lacking assignments and
225 :     annotations that exist on your production system. However, if you have
226 :     added new genomes to the production system (that are not part of the
227 :     releases you may acquire via DVDs), you should get the new release,
228 :     install the versions of your local genomes, and then do this update
229 :     procedure.
230 :     <p>
231 :     The plan we propose is to build a completely encapsulated new version
232 :     of the system, then capture updates from the old production system, update
233 :     the new production system, and then make the new version the actual
234 :     production system. This last step amounts to altering a symbolic link
235 :     to point at the new production system rather than the old. This has
236 :     the virtue of ease of recovery -- that is, if something goes wrong you
237 :     can flip back to the old system.
238 :     The actual steps are as follows:
239 : olson 1.1 <ol>
240 : gdpusch 1.9
241 :     <li> First, make sure that you are in the BASH shell by typing "echo $SHELL";
242 :     if the result is not "bash", type "bash" to enter the BASH shell.
243 : gdpusch 1.10 <p>
244 : gdpusch 1.9
245 :     <li> Next, check that the result of typing "which perl" is the version
246 :     of perl owned by the SEED; it should look something like
247 :     <pre>
248 :     /Users/fig/FIGdisk/env/mac/bin/perl
249 :     </pre>
250 :     although the exact results will depend on where your existing copy
251 :     of the SEED is installed, whether your platform is a Macintosh or LINUX,
252 :     etc. If the result does not look similar to the above, type:
253 :     <pre>
254 :     source Path_to_FIGdisk/config/fig-user-env.sh
255 :     </pre>
256 :     to setup your FIG environment properly.
257 : gdpusch 1.10 <p>
258 : gdpusch 1.9
259 :     <li> Next, make a copy of the Code Distribution Environment (from a DVD
260 : overbeek 1.7 or via the network). Suppose that we have made such a directory in
261 :     CodeDistEnv. Then use,
262 :     <pre>
263 :     cd CodeDistEnv
264 :     ./install-code TargetDirectory
265 :     </pre>
266 :     where <b>TargetDirectory</b> is where you wish to build the new
267 :     production version. We recommend calling it something like
268 :     <b>FIGdisk.July24</b>.
269 : gdpusch 1.10 <p>
270 : overbeek 1.7
271 : gdpusch 1.6 <li> Stop all work on the production machine for the duration of the update.
272 :     You do this by clicking on the "Seed Control Panel" link,
273 :     and then entering an explanatory message in the text box
274 :     and clicking on the "Disable SEED server" button.
275 : gdpusch 1.10 <p>
276 : gdpusch 1.9
277 : gdpusch 1.6 <li> You now need to capture the assignments, annotations and
278 :     subsystems work that has been done on the production machine.
279 :     To do this, you need to know when the last production release
280 :     was installed. Suppose that it was July 1, 2004.
281 : gdpusch 1.9 If that was the date, we recommend that you run
282 : gdpusch 1.6 <pre>
283 :     <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004</b>
284 :     </pre>
285 : gdpusch 1.9
286 : gdpusch 1.6 This will capture your updates and save them in the directory
287 : gdpusch 1.9 /tmp/sync.data.july.1.2004.<br>
288 : gdpusch 1.10 <p>
289 : gdpusch 1.9
290 : overbeek 1.7 <li>Now, you need to stop the existing production system using
291 :     <pre>
292 :     ~/FIGdisk/bin/stop-servers
293 :     </pre>
294 : gdpusch 1.10 <p>
295 : overbeek 1.7
296 :     <li>Now, you need to configure the runtime environment for the system
297 :     you are running on.
298 :     To do this, use
299 :     <pre>
300 :     cd TargetDirectory
301 :     ./configure MacOrLinux
302 :     </pre>
303 :     where <b>MacOrLinux</b> must be a currently supported environment.
304 :     Those that are supported on July 24, 2004 are <b>mac</b> for
305 :     Macintoshes running panther, <b>mac-jaguar</b> for those that have not
306 :     upgraded to panther, and <b>linux-postgres</b>.
307 : gdpusch 1.10 <p>
308 : overbeek 1.7
309 :     <li>Now, you need to insert the new Data directory into the newly
310 :     constructed version of the SEED. To do this use
311 :     <pre>
312 : gdpusch 1.9 chmod -R 777 TheNewData
313 : overbeek 1.7 cd TargetDirectory/FIG
314 :     ln -s TheNewData Data
315 :     </pre>
316 :     where TheNewData is the new Data directory, which normally comes from the
317 :     update system. If you acquired a new Data directory via Data DVDs, you
318 :     will need to unpack them using the README instructions, but what
319 :     results is a new version of the <b>Data</b> directory.
320 : gdpusch 1.10 <p>
321 : gdpusch 1.9
322 : overbeek 1.7 <li>Now, you need to start the servers in order to load the databases
323 :     with the new release using
324 :     <pre>
325 :     cd TargetDirectory/bin
326 :     ./start-servers
327 :     cd ..
328 :     source config/fig-user-env.sh
329 :     init_FIG
330 :     fig load_all
331 :     </pre>
332 :     This last command will run for several hours.
333 : gdpusch 1.10 <p>
334 : gdpusch 1.9
335 : gdpusch 1.11 (<b>WARNING:</b> Please note that, because the new SEED's databases
336 :     do not yet exist, the `init_FIG` command will generate two totally
337 :     harmless but rather terrrifying error messages the very first time it is executed,
338 :     so that its output will look something like this:
339 :    
340 :     <pre>
341 :     DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL: Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
342 :    
343 :     Initializing new SEED database fig
344 :    
345 :     ERROR: DROP DATABASE: database "fig" does not exist
346 :     dropdb: database removal failed
347 :     CREATE DATABASE
348 :     NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
349 :     CREATE TABLE
350 :    
351 :     Complete. You will need to run "fig load_all" to load the data.
352 :     </pre>
353 :     We recognize that that generating the above two faux "FATAL" errors
354 :     constitutes a rather ugly and inelegant implementation,
355 :     but we have not yet found a more elegant database initialization method
356 :     that can avoid generating them.)
357 :     <p>
358 :    
359 : gdpusch 1.6 <li> Now, you need to capture the changes made to the old production
360 :     version using something like
361 :     <pre>
362 :     <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
363 :     </pre>
364 : gdpusch 1.10 <p>
365 :    
366 : overbeek 1.7 <li>Run
367 : gdpusch 1.10 <pre>
368 : overbeek 1.7 index_annotations
369 :     index_subsystems
370 :     make_indexes
371 : gdpusch 1.10 </pre>
372 :     <p>
373 : gdpusch 1.9
374 :     <li> Now, finally, you should alter the symbolic link in <i>~fig</i> to
375 : overbeek 1.7 the current FIGdisk using something like:
376 :     <pre>
377 :     cd ~fig
378 :     rm FIGdisk # should be removing a symbolic link to the current SEED
379 :     ln -s TargetDirectory FIGdisk
380 :     </pre>
381 :     That should make the new SEED the one available through the Web interface.
382 : gdpusch 1.10 <p>
383 : gdpusch 1.9
384 : gdpusch 1.6 <li> You should now bring your update system to the same state as the
385 :     production system. This can be done by making sure that
386 :     <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.
387 :     If the production and update systems are run on the same machine, then
388 :     the directory is already there. If not, copy it to <b>/tmp</b> on the
389 :     update machine. Then run
390 :     <br>
391 :     <pre>
392 :     <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
393 :     </pre>
394 :     <br>
395 :     on the update machine.
396 : olson 1.1 </ol>
397 : overbeek 1.7 <p>
398 :    
399 : olson 1.1 Our experience is that anytime a group wishes to share a common production environment,
400 : overbeek 1.4 this 2-system approach is the way to do it. You can, if necessary,
401 :     put both systems on the same physical machine. This does require some
402 :     special handling in setting up two different <b>FIGdisk</b>
403 :     directories. We recommend using <b>FIGdisk.production</b> and
404 :     <b>FIGdisk.update</b>. However, in general it makes sense to use two
405 :     separate physical machines, for backup if nothing else. The update
406 :     system can usually be run on a $2000 (or less) box, although it is
407 :     desirable to spend a little more and get at least 1 gigabyte of main
408 :     memory and 200 gigabytes of external disk.
409 : olson 1.1 <br>
410 : gdpusch 1.8 <h2 id="adding_genomes">Adding a New Genome to an Existing SEED</h2>
411 : olson 1.1 To add a new genome to a running SEED is fairly easy, but there are a
412 :     number of details that do have to be handled with care.
413 :     <p>
414 :     The first thing to note is that the SEED does not include tools to call genes -- you are expected
415 :     to provide gene calls. This may change at some point, but for now you must call your own genes. A
416 :     number of good tools now exist in the public domain, and you will need to find one that seems adequate
417 :     for your needs.
418 :     <p>
419 :     Let us now
420 :     cover how to prepare the actual data. You need to construct a directory (in somewhere like ~fig/Tmp)
421 :     of the following form:
422 :     <br>
423 :     <table width="100%">
424 :     <tr>
425 :     <td><tt>GenomeId</tt></td>
426 :     <td></td>
427 :     <td></td>
428 :     <td></td>
429 :     <td>of the form xxxx.y where xxxx is the taxon ID and y is an integer</td>
430 :     </tr>
431 :    
432 :     <tr>
433 :     <td></td>
434 :     <td><tt>PROJECT</tt></td>
435 :     <td></td>
436 :     <td></td>
437 :     <td> a file containg a description of the source of the data</td>
438 :     </tr>
439 :    
440 :     <tr>
441 :     <td></td>
442 :     <td><tt>GENOME</tt></td>
443 :     <td></td>
444 :     <td></td>
445 :     <td>a file containing a single line identifying the genus, species and strain</td>
446 :     </tr>
447 :    
448 :     <tr>
449 :     <td></td>
450 :     <td><tt>TAXONOMY</tt></td>
451 :     <td></td>
452 :     <td></td>
453 :     <td>a file containing a single line containing the NCBI taxonomy</td>
454 :     </tr>
455 :    
456 :     <tr>
457 :     <td></td>
458 :     <td><tt>RESTRICTIONS</tt></td>
459 :     <td></td>
460 :     <td></td>
461 :     <td>a file containing a description of distribution restrictions (optional)</td>
462 :     </tr>
463 :    
464 :     <tr>
465 :     <td></td>
466 :     <td><tt>CONTIGS</tt></td>
467 :     <td></td>
468 :     <td></td>
469 :     <td>contigs in fasta format</td>
470 :     </tr>
471 :    
472 :     <tr>
473 :     <td></td>
474 :     <td><tt>assigned_functions</tt></td>
475 :     <td></td>
476 :     <td></td>
477 :     <td>function assignments for the protein-encoding genes (optional)</td>
478 :     </tr>
479 :    
480 :     <tr>
481 :     <td></td>
482 :     <td><tt>Features</tt></td>
483 :     </tr>
484 :    
485 :     <tr>
486 :     <td></td>
487 :     <td></td>
488 :     <td><tt>peg</tt></td>
489 :     </tr>
490 :    
491 :     <tr>
492 :     <td></td>
493 :     <td></td>
494 :     <td></td>
495 :     <td><tt>tbl</tt></td>
496 :     <td>describes locations and aliases for the protein-encoding genes</td>
497 :     </td>
498 :     </tr>
499 :    
500 :     <tr>
501 :     <td></td>
502 :     <td></td>
503 :     <td></td>
504 :     <td><tt>fasta</tt></td>
505 :     <td>fasta file of translations of the protein-encoding genes</td>
506 :     </td>
507 :     </tr>
508 :    
509 :     <tr>
510 :     <td></td>
511 :     <td></td>
512 :     <td><tt>rna</tt></td>
513 :     </tr>
514 :    
515 :     <tr>
516 :     <td></td>
517 :     <td></td>
518 :     <td></td>
519 :     <td><tt>tbl</tt></td>
520 :     <td>describes locations and aliases for the rna-encoding genes</td>
521 :     </td>
522 :     </tr>
523 :    
524 :     <tr>
525 :     <td></td>
526 :     <td></td>
527 :     <td></td>
528 :     <td><tt>fasta</tt></td>
529 :     <td>fasta file of the DNA corresponding to the genes</td>
530 :     </td>
531 :     </tr>
532 :    
533 :    
534 :     </table>
535 :    
536 :     <!--
537 :    
538 :     <pre>
539 :     GenomeID of the form xxxx.y where xxxx is the taxon ID and y is an integer
540 :    
541 :     PROJECT a file containg a description of the source of the data
542 :    
543 :     GENOME a file containing a single line identifying the genus, species and strain
544 :    
545 :     TAXONOMY a file containing a single line containing the NCBI taxonomy
546 :    
547 :     RESTRICTIONS a file containing a description of distribution restrictions (optional)
548 :    
549 :     contigs contigs in fasta format
550 :    
551 :     assigned_functions function assignments for the protein-encoding genes (optional)
552 :    
553 :     Features
554 :    
555 :     peg
556 :     tbl descibes locations and aliases for the protein-encoding genes
557 :    
558 :     fasta fasta file of translations of the protein-encoding genes
559 :    
560 :     rna
561 :     tbl describes locations and aliases for the rna-encoding genes
562 :    
563 :     fasta fasta file of the DNA corresponding to the genes
564 :     </pre>
565 :     -->
566 :     <br>
567 :     <br>
568 :     Let us expand on this very brief description:
569 :     <ol>
570 :     <li>
571 :     The name of the directory must be of the form xxxx.y where xxxx is the
572 :     taxon ID, and y is a sequence number. For example, 562.1 might be
573 :     used for <i>E.coli</i>, since 562 is the NCBI taxon ID for
574 :     <i>Escherichia coli</i>. The sequence number (y) is used to
575 :     distinguish multiple genomes having the same taxon ID.
576 :     <br><br>
577 :     <li>
578 :     The assigned_functions file contains assignments of function for the
579 :     protein-encoding genes. is of the form
580 :     <pre>
581 :     Id\tFunction\tConfidence (\t stands for a tab character)
582 :     </pre>
583 :     The Id must be a valid PEG Id. These are of the form:
584 :     <pre>
585 :     fig|xxxx.y.peg.z
586 :     </pre>
587 :     where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes
588 :     the peg (protein-encoding gene).
589 :     <br>
590 :     <i>Confidence</i> is a single character code:
591 :     <br>
592 :     <ul>
593 :     <li>a space for "normal"
594 :     <li>w for "weak"
595 :     <li>e for experimentally verified
596 :     <li>s for "strong evidence (but not experimental)"
597 :     </ul>
598 :     The second tab and the confidence code can be omitted (it will default to a space).
599 :     The assigned_functions file is optional. You can leave it blank and, after adding the genome
600 :     to the SEED, ask for automated assignments.
601 :     <br><br>
602 :     <li>
603 :     The tbl files specify the locations of genes, as well as any aliases. Each line in a tbl line
604 :     is of the form
605 :     <br>
606 :     <pre>
607 :     Id\tLocation\tAliases (the aliases are separated by tabs)
608 :     </pre>
609 :     The Id must conform to the fig|xxxx.y.peg.z format described above. The <i>Location</i> is of the form
610 :     <br>
611 :     <pre>
612 :     L1,L2,L3...Ln
613 :    
614 :     where each Li describes a region on a contig and is of the form
615 :    
616 :     <i>Contig_Begin_End</i> where
617 :    
618 :     Contig is the Id of the contig,
619 :     Begin is the position of the first character, and
620 :     End is the position of the last character
621 :     </pre>
622 :     <ul>
623 :     <li>if Begin > End, the region being described is on the complementary strand, and
624 :     <li>the End position is the last character preceding the stop codon (i.e., the region
625 :     corresponding to a protein-encoding gene is thought of as including all bases from the
626 :     first base of the start codon to the last base before the stop codon.
627 :     </ul>
628 :     For example,
629 :     <pre>
630 :     fig|562.1.peg.15 Escherichia_coli_K12_14168_15295 dnaJ b0015 sp|P08622 gi|16128009
631 :     </pre>
632 :     describes the <i>dnaJ</i> gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12.
633 :     The gene is from the genome 562.1, and it has 4 specified aliases.
634 :     <li>
635 :     The fasta files must have gene Ids that match tbl file entries. The <i>peg</i> fasta file contains translations,
636 :     while the <i>rna</i> fasta file contains DNA sequences.
637 :     <li>
638 :     Both the <i>peg</i> and the <i>rna</i> subdirectories are optional.
639 :     </ol>
640 :     <br>
641 :     The SEED provides a utility that can be used to produce such a directory from a GenBank entry. Thus,
642 :     <br>
643 :     <pre>
644 :     parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
645 :     </pre>
646 :     would attempt to produce a properly formatted directory (~/Tmp/562.4) containing
647 :     the data encoded in the GenBank entry from the file <i>genbank.entry.for.a.new.E.coli.genome</i>.
648 :     This script is far from perfect, and there is huge variance in encodings in GenBank
649 :     files. So, use it at your own risk (and, manually check the output).
650 :     <p>
651 :     You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory
652 :     to see examples of how it should be done.
653 :     <p>
654 :     So, supposing that you have built a valid directory (say, <i>/Users/fig/Tmp/562.4</i>), you can add the genome using
655 :     <pre>
656 :     fig add_genome /Users/fig/Tmp/562.4
657 :     </pre>
658 :     <br>
659 :     The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
660 :     be computed for the protein-encoding genes.
661 :    
662 : gdpusch 1.8 <h2 id="sims">Computing Similarities</h2>
663 : olson 1.1
664 :     Adding a genome does not automatically get similarities computed for the new genome; it queues the request.
665 :     To get the similarities actually computed, you need to establish a computational environment on which
666 :     the blast runs will be made, and then initiate a request on the machine running the SEED.
667 :     <p>
668 :     This is not a completely trivial process because there are a variety of different ways to compute
669 :     similarities:
670 :     <ol>
671 :     <li> You can just compute them on the system running the SEED. This can take several days, but this
672 :     is often a perfectly reasonable way to get the job done.
673 :     <li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),
674 :     and you wish to just exploit these machines to do the blast runs.
675 :     <li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).
676 :     In this case, it makes sense to utilize a large computational resource, and this resource may either
677 :     be a local cluster or a service provided over the net.
678 :     </ol>
679 :     <br>
680 :     To establish the flexibility needed to support all of these alternatives, we implemented the following
681 :     approach:
682 :     <ul>
683 :     <li>
684 :     The user can describe one or more <b>similarity computational environments</b>
685 :     in a configuration file called <i>similarities.config</i>. The details of this encoding
686 :     are beyond the scope of this document.
687 :     These environments all represent potential ways to compute similarities.
688 :     <br>
689 :     <li>
690 :     When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,
691 :     he runs a program specifying a specific similarity computational environment. This causes all
692 :     the queued similarity requests to be batched up and sent off to the specified server (which may simply
693 :     be on the same machine). He would use the <b>generate_similarities</b> command specifying two parameters: the
694 :     first specifies a similarities computational environment, and the second specifies whether or not automated assignments
695 :     should be computed as the similarity computations complete and the results are installed.
696 :     As the similarities complete, they will automatically be installed. Further, if a set of similarities arrive
697 :     for a given protein-encoding gene, and if there is no current assignment of function for the gene,
698 :     an automated assignment may be computed. Whether or not such automated assignments are computed is determined
699 :     by the second parameter in the command used by the systems administrator to initiate the request. For example,
700 :     <pre>
701 :     generate_similarities local auto-assignments
702 :     </pre>
703 :     specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast
704 :     requests on this machine", and requests automated assignments for all protein-encoding genes that currently either
705 :     have no assigned function or have an assigned function that is "hypothetical".
706 :     </ul>
707 :     <br>
708 :    
709 :     We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined
710 :     interfaces for handling high-volume requests. At FIG, we will maintain a set of instructions on how to set up
711 :     your configuration to exploit these resources.
712 : overbeek 1.12 <p>
713 :     No matter how you produce the new similarities, they need to be added
714 :     as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory. Then, you
715 :     need to index these similarities using
716 :     <pre>
717 :     index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
718 :     </pre>
719 :     where XXXX is the file you added. If you have more than one such
720 :     file, just put in several arguments for the command. This will
721 :     "index" the similarities in that any of the new PEGs which have
722 :     similarities connecting them to other PEGs from the existing genomes
723 :     can now be displayed. However, the connection from the existing
724 :     genomes to the new PEGs does not yet exist (we call these the "flips"
725 :     of the computed sims). To get this ability, you need to go through a
726 :     process that will make your system unavailable for a period (and, it
727 :     will produce a substantial load on your system for a day or so, while
728 :     the SEED sorts, sifts, inserts, and generally plays with the "flips").
729 :     <br>
730 :     The extra steps you need to take to make a fully functional version
731 :     are as follows:
732 :     <ol>
733 :     <li>
734 :     First, you need to run
735 :     <pre>
736 :     update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
737 :     </pre>
738 :     This should produce updated similarity files in a VERY BIG directory
739 : overbeek 1.14 that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
740 : overbeek 1.12 put anywhere). This may run as much as a day or so (and you can watch
741 :     its progress as it updates the similarity files).
742 :     <li>The next step is to replace the existing similarity files with the
743 :     newly computed ones. You need to make the SEED unavailable (via the
744 :     <b>SEED Control Panel</b>.
745 :     <li>Then, blow away the existing similarities using something like
746 :     <pre>
747 :     rm ~/FIGdisk/FIG/Data/Sims/*
748 :     rm ~/FIGdisk/FIG/Data/NewSims/*
749 :     cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
750 :     rm -r ~/Tmp/FlippedSims
751 :     </pre>
752 :     There are several ways to do this. You might want to save the old
753 :     similarities somewhere. You might be able to move (rather than copy),
754 :     the similarities. Whatever suits you.
755 :     <li> Then run
756 :     <pre>
757 :     index_sims
758 :     </pre>
759 :     to re-index all of the similarities, and you should be fully
760 :     operational.
761 :     </ol>
762 :     <br>
763 : olson 1.1
764 : gdpusch 1.8 <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>
765 : olson 1.1
766 :     There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
767 :     when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
768 :     deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy
769 :     of the SEED containing a subset of the entire collection of genomes.
770 :     <p>
771 :     To delete a set of genomes from a running version of the SEED, just use
772 :     <pre>
773 :     fig delete_genomes G1 G2 ...Gn (where G1 G2 ... Gn designates a list of genomes)
774 :     </pre>
775 :     For example,
776 :     <pre>
777 :     fig delete_genomes 562.1
778 :     </pre>
779 :     could be used to delete a single genome with a genome ID of 562.1.
780 :     <p>
781 :     To make a copy with some genomes deleted to give to someone else requires a little different approach.
782 :     To extract a set of genomes from an existing version of the SEED, you need to run the command
783 :     <pre>
784 :     extract_genomes Which ExistingData ExtractedData
785 :     </pre>
786 :    
787 :     The first argument is either the word "unrestricted" or the name of a file containing a list of
788 :     genome IDs (the genomes that are to be retained in the extraction). The second argument is
789 :     the path to the current Data directory. The third argument specifies the name of a directory
790 :     that is created holding the extraction. Thus,
791 :     <pre>
792 : overbeek 1.12 extract_genomes unrestricted ~/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
793 : olson 1.1 </pre>
794 :     would created the extracted Data directory for you. If you wish to then produce a fully distributable
795 :     version of the SEED from the existing version and the extracted Data directory, you would
796 :     use
797 :     <pre>
798 : overbeek 1.12 make_a_SEED ~/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
799 : olson 1.1 rm -rf /Volumes/Tmp/ExtractedData
800 :     </pre>
801 :    
802 : gdpusch 1.8 <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>
803 : olson 1.1
804 :     When the initial SEED was constructed, similarities were computed. For most similarities of the form
805 :     "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2. This is not always true,
806 :     since we truncate the number of similarities associated with any single Id (leaving us in a situation
807 :     in which we may have similarity recorded for Id1, but not Id2). When a genome is added, if Id1 was an added
808 :     protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2. This means that when looking
809 :     at genes from previously existing organisms, you never get links back to the added pegs. This is not totally
810 :     satisfactory.
811 :     <p>
812 :     Periodically, it is probably a good idea to "reinitegrate the similarities". This can be done by
813 :     just running
814 :     <pre>
815 :     reintegrate_sims
816 :     # update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
817 :     </pre>
818 :     The job will probably run for quite a while (perhaps as much as a day or two).
819 :    
820 : gdpusch 1.8 <h2 id="pins_and_clusters">Computing "Pins" and "Clusters"</h2>
821 : olson 1.1
822 :     The SEED displays potentially significant clusters on prokaryotic chromosomes. In the
823 :     process of finding preserved contiguity, it computes "pins", which are simply a set of genes
824 :     that are believed to be orthologs that cluster with similar genes. If you add your own genome,
825 :     you will probably want to compute and enter these into the active database. This can be done
826 :     using
827 :     <pre>
828 :     compute_pins_and_clusters G1 G2 G3 ...
829 :     </pre>
830 :     where the arguments are genome Ids. Thus,
831 :     <pre>
832 :     compute_pins_and_clusters 562.4
833 :     </pre>
834 :     would compute and add entries for all of the <i>pegs</i> in genome 562.4.
835 : gdpusch 1.13
836 :     <h2 id="auto_annotation">
837 :     Automatic Annotation of Genomes
838 :     </h2>
839 :     The SEED provides a simple but limited capability for automated assignment
840 :     of protein-encoding gene function based on similarity.
841 :     Candidate functions are assigned scores based on the combined strengths
842 :     of all BLASTP similarities to genes carrying that particular assignment,
843 :     weighted by the provenance and assignment-confidence for each similar gene.
844 :     The final automated function assignment is then determined from the
845 :     list of candidate functions and their associated scores.
846 :    
847 :     Automated assignment is a four-step process:
848 :     <ol>
849 :     <li> Create a list of PEGs to be automatically assigned.
850 :     If one wishes to make assignments to an entire organism or set of organisms
851 :     that are already installed in the SEED, the simplest method for creating
852 :     this list is to type the following command:
853 :     <pre>
854 :     pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
855 :     </pre>
856 :    
857 :     <p>
858 :     <li> Next, create a list of candidate function-assignments using the following
859 :     command:
860 :     <pre>
861 :     auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
862 :     </pre>
863 :     (NOTE: The `auto_assign` command has some additional optional parameters;
864 :     for example, if one knows that all the PEGs in 'peg.list' are from
865 :     prokaryotic organisms, one can make use of this additional informaation
866 :     by invoking `auto_assign` as follows:
867 :     <pre>
868 :     auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
869 :     </pre>
870 :     Also, if one wishes to use an alternate file of similarity data named 'simfile'
871 :     instead of the precomputed similarities stored in the SEED, one can instead type:
872 :     <pre>
873 :     auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
874 :     </pre>
875 :     Finally, `auto_assign` can read a set of alternate parameters from a file,
876 :     but we recommend that you stick with the default settings, and not exploit this
877 :     last feature unless you are a qualified SEED wizard.)
878 :     <p>
879 :    
880 :     <li> Next, create a SEED format assigned-functions file as follows:
881 :     <pre>
882 :     make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
883 :     </pre>
884 :     Alternately, if you wish to suppress the class of "non-informative" function assignments
885 :     such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
886 :     you may do so using the '-no_hypos' flag:
887 :     <pre>
888 :     make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
889 :     </pre>
890 :    
891 :     <li> Finally, install the automated assignments in the seed using the command
892 :     <pre>
893 :     fig auto_assignF ~/Tmp/assigned_functions
894 :     </pre>
895 :    
896 :     </ol>
897 :    
898 :     It should be once again noted that the SEED's automated assignment algorithm
899 :     is quite simple and crude, being only slightly better than simply assigning
900 :     the function of the highest-scoring BLASTP hit; however, it at least provides
901 :     a "quick and dirty" starting point for making an initial assessment of a genome,
902 :     which may then be clraned up and refined by skilled genome annotators.
903 :    
904 :    
905 :    
906 :    
907 :    
908 :    

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3