[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Annotation of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.18 - (view) (download) (as text)

1 : olson 1.2 <h1>SEED Administration</h1>
2 : gdpusch 1.8
3 :     <p>
4 :     This tutorial discusses a number of issues that you will need to know about
5 :     in order to install, share, and maintain your SEED installation.
6 :     It is organized as follows:
7 :     </p>
8 :    
9 :     <ul>
10 :     <li><A HREF="#backups">
11 :     Backing Up Your Data
12 :     </A>
13 :    
14 :     <li><A HREF="#copying">
15 :     Copying a Version of the SEED
16 :     </A>
17 :    
18 :     <li><A HREF="#multiple_copies">
19 :     Running Multiple Copies of the SEED
20 :     </A>
21 :    
22 :     <li><A HREF="#adding_genomes">
23 :     Adding a New Genome to an Existing SEED
24 :     </A>
25 :    
26 : overbeek 1.17 <li><A HREF="#importing_external">
27 :     Importing External Protein Data
28 :     </A>
29 :    
30 : gdpusch 1.8 <li><A HREF="#sims">
31 :     Computing Similarities
32 :     </A>
33 :    
34 :     <li><A HREF="#deleting_genomes">
35 :     Deleting Genomes from a Version of the SEED
36 :     </A>
37 :    
38 :     <li><A HREF="#reintegrate_sims">
39 :     Periodic Reintegration of Similarities
40 :     </A>
41 :    
42 :     <li><A HREF="#pins_and_clusters">
43 :     Computing "Pins" and "Clusters"
44 :     </A>
45 :    
46 : gdpusch 1.13 <li><A HREF="#auto_annotation">
47 :     Automatic Annotation of Genomes
48 :     </A>
49 :    
50 : gdpusch 1.8 </ul>
51 :    
52 :    
53 :     <h2 id="backups">Backing Up Your Data</h2>
54 : olson 1.1 The data and code stored within the SEED are organized as follows:
55 :     <pre>
56 :     ~fig on a Mac: /Users/fig; on Linux: /home/fig
57 :     FIGdisk
58 :     dist source code
59 :     FIG
60 :     Tmp temporary files
61 :     Data data in readable form
62 :     </pre>
63 : olson 1.2 <ol><li>
64 : olson 1.1 The directory <b>FIGdisk</b> holds both the code and data for the
65 :     SEED. The data is loaded into a database system that stores the data
66 :     in a location external to FIGdisk, but otherwise a running SEED is
67 :     encapsulated within FIGdisk. A symbolic link to FIGdisk is maintained
68 :     in the directory ~fig.
69 :     <br>
70 :     <li>
71 :     Within FIGdisk there are a two key directories:
72 :     <br>
73 :     <br><ol><li>
74 :     <b>dist</b> contains the source code, and
75 :    
76 :     <li>
77 :     <b>FIG</b> contains the execution environment and Data.
78 :     </ol>
79 :     <br>
80 :     <li>
81 :     Within FIG, there are a number of directories. The most important are
82 :     <br>
83 :     <br>
84 :     <ol>
85 :     <li>
86 :     <b>Data</b>, which contains all of the data in a human-readable form,
87 :     and
88 :     <br>
89 :     <br>
90 :     <li>
91 :     <b>Tmp</b>, which contains the temporary files built by SEED in
92 :     response to commands.
93 :     </ol>
94 :     </ol>
95 :     <br>
96 :     Hence, to backup your data, you should simply copy the Data
97 :     directory. It should be backed up to a separate disk. Suppose that
98 :     /Volumes/Backup is a backup disk. Then,
99 :     <br>
100 :     <pre>
101 : overbeek 1.12 cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
102 : olson 1.1 gzip -r /Volumes/Backup/Data.Backup
103 :     </pre>
104 :     <br>
105 :     would be a reasonable way to make a backup. The copy preserves
106 :     permissions, copies recursively, and does not follow symbolic links.
107 :     <br>
108 : gdpusch 1.8 <h2 id="copying">Copying a Version of the SEED</h2>
109 : olson 1.1
110 :     To make a second copy of the SEED (either for a friend or for yourself), you should use tar
111 :     to preserve a few symbolic links (which are relative, not absolute; this means that they can
112 :     be copied while still preserving the integrity of the whole system).
113 :     So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it
114 :     to /Volumes/To. Use
115 :     <pre>
116 :     cd /Volumes/From
117 :     tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)
118 :     </pre>
119 :     <p>This should produce the desired copy. In this case, suppose that we are in a
120 :     Mac OS X
121 :     environment, and <b>From</b> and <b>To</b> are firewire disks. To install the system on a friends
122 :     Mac, you would unmount <b>To</b>, plug it into the new machine, and then set the symbolic link to the active
123 :     FIGdisk using
124 :     <br>
125 :     </p>
126 :     <table border="1" bgcolor="#CCCCCC">
127 :     <tr>
128 :     <td width="403"><font face="Courier New, Courier, mono">cd ~fig</font></td>
129 :     <td width="285">&nbsp;</td>
130 :     </tr>
131 :     <tr>
132 :     <td><font face="Courier New, Courier, mono">rm FIGdisk</font></td>
133 :     <td># fails if there is no existing FIGdisk on the machine</td>
134 :     </tr>
135 :     <tr>
136 :     <td><font face="Courier New, Courier, mono">ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk</font></td>
137 :     <td>&nbsp;</td>
138 :     </tr>
139 :     <tr>
140 : olson 1.2 <td><font face="Courier New, Courier, mono">bash</font></td>
141 :     <td>Switch to using the bash shell</td>
142 :     </tr>
143 :     <tr>
144 : olson 1.1 <td><font face="Courier New, Courier, mono">cd FIGdisk</font></td>
145 :     <td>&nbsp;</td>
146 :     </tr>
147 :     <tr>
148 : olson 1.2 <td height="23"><font face="Courier New, Courier, mono">cp CURRENT_RELEASE DEFAULT_RELEASE</font></td>
149 : olson 1.1 <td># Causes the new configuration to use the code that was running in the
150 :     original installation</td>
151 :     </tr>
152 : olson 1.2 <tr>
153 :     <td height="23"><font face="Courier New, Courier, mono">./configure <em>arch-name</em></font></td>
154 :     <td># Configure the new SEED disk for architecture <em>arch-name</em>. </td>
155 :     </tr>
156 :     <tr>
157 :     <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
158 :     </font></td>
159 :     <td># Set up the environment for using the SEED</td>
160 :     </tr>
161 :     <tr>
162 :     <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
163 :     </font></td>
164 :     <td># Start the database server and registration servers</td>
165 :     </tr>
166 :     <tr>
167 :     <td height="23"><font face="Courier New, Courier, mono">init_FIG <br>
168 :     </font></td>
169 :     <td># Initialize a new relational database</td>
170 :     </tr>
171 :     <tr>
172 :     <td height="23"><font face="Courier New, Courier, mono">fig load_all</font></td>
173 :     <td># Load the database from the SEED data files. This may take several hours</td>
174 :     </tr>
175 :     </table>
176 :     <p>At this point, the new SEED copy should be ready to use. You only need to
177 :     perform the configure, init_FIG, and fig load_all steps once after installing
178 :     a new copy of the SEED. After a reboot or other clean start of the computer,
179 :     you will only have to do these steps:</p>
180 : olson 1.3 <table border="1" bgcolor="#EEEEEE">
181 : olson 1.2 <tr>
182 :     <td width="403"><font face="Courier New, Courier, mono">cd ~fig/FIGdisk</font></td>
183 :     <td width="285">&nbsp;</td>
184 :     </tr>
185 :     <tr>
186 :     <td><font face="Courier New, Courier, mono">bash</font></td>
187 :     <td>Switch to using the bash shell</td>
188 :     </tr>
189 :     <tr>
190 :     <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
191 :     </font></td>
192 :     <td># Set up the environment for using the SEED</td>
193 :     </tr>
194 :     <tr>
195 :     <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
196 :     </font></td>
197 :     <td># Start the database server and registration servers</td>
198 :     </tr>
199 : olson 1.1 </table>
200 : olson 1.2 <p>Upon setting up a new computer for running SEED, you should read the full
201 :     documentation for SEED installation, as it has a number of platform-specific
202 :     modifications that need to be performed. This document can currently be found
203 :     at the following
204 :     location in the SEED Wiki: </p>
205 : olson 1.1 <blockquote>
206 :     <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions"> http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
207 :     </blockquote>
208 : gdpusch 1.8 <h2 id="multiple_copies">Running Multiple Copies of the SEED</h2>
209 : olson 1.1
210 :     For individual users that use the SEED to support comparative analysis, a single copy is completely
211 :     adequate. Adding genomes can usually be done without disrupting normal use, and a very occasional major
212 :     reorganization that runs over the weekend is not a big deal.
213 :     <p>
214 :     The situation is somewhat different when the system is being used to support a major sequencing/annotation
215 :     effort. In this case, you have a user community that is sensitive to disruptions of service, and you
216 :     have frequent demands to update versions of data. In this case, it is best to have two systems: the
217 :     <b>production system</b> is used to support the larger user community, and the <b>update system</b> is
218 : overbeek 1.7 used to prepare updated versions of the system.
219 :     New genomes are added to the update system, and then periodically a
220 :     revised Data directory is extracted to update the production system.
221 :     Even so, work stoppages of a few hours will occur when
222 :     new releases are swapped in.
223 :     <p>
224 :     This use of an "update" and a "production" system is quite analogous
225 :     to running a production system which is occasionally updated from new
226 :     Data DVDs (which FIG normally makes available about every 4-6 months).
227 :     That is, in both cases you are updating a production system from a
228 :     newly created <b>Data</b> directory that is lacking assignments and
229 :     annotations that exist on your production system. However, if you have
230 :     added new genomes to the production system (that are not part of the
231 :     releases you may acquire via DVDs), you should get the new release,
232 :     install the versions of your local genomes, and then do this update
233 :     procedure.
234 :     <p>
235 :     The plan we propose is to build a completely encapsulated new version
236 :     of the system, then capture updates from the old production system, update
237 :     the new production system, and then make the new version the actual
238 :     production system. This last step amounts to altering a symbolic link
239 :     to point at the new production system rather than the old. This has
240 :     the virtue of ease of recovery -- that is, if something goes wrong you
241 :     can flip back to the old system.
242 :     The actual steps are as follows:
243 : olson 1.1 <ol>
244 : gdpusch 1.9
245 :     <li> First, make sure that you are in the BASH shell by typing "echo $SHELL";
246 :     if the result is not "bash", type "bash" to enter the BASH shell.
247 : gdpusch 1.10 <p>
248 : gdpusch 1.9
249 :     <li> Next, check that the result of typing "which perl" is the version
250 :     of perl owned by the SEED; it should look something like
251 :     <pre>
252 :     /Users/fig/FIGdisk/env/mac/bin/perl
253 :     </pre>
254 :     although the exact results will depend on where your existing copy
255 :     of the SEED is installed, whether your platform is a Macintosh or LINUX,
256 :     etc. If the result does not look similar to the above, type:
257 :     <pre>
258 :     source Path_to_FIGdisk/config/fig-user-env.sh
259 :     </pre>
260 :     to setup your FIG environment properly.
261 : gdpusch 1.10 <p>
262 : gdpusch 1.9
263 :     <li> Next, make a copy of the Code Distribution Environment (from a DVD
264 : overbeek 1.7 or via the network). Suppose that we have made such a directory in
265 :     CodeDistEnv. Then use,
266 :     <pre>
267 :     cd CodeDistEnv
268 :     ./install-code TargetDirectory
269 :     </pre>
270 :     where <b>TargetDirectory</b> is where you wish to build the new
271 :     production version. We recommend calling it something like
272 :     <b>FIGdisk.July24</b>.
273 : gdpusch 1.10 <p>
274 : overbeek 1.7
275 : gdpusch 1.6 <li> Stop all work on the production machine for the duration of the update.
276 :     You do this by clicking on the "Seed Control Panel" link,
277 :     and then entering an explanatory message in the text box
278 :     and clicking on the "Disable SEED server" button.
279 : gdpusch 1.10 <p>
280 : gdpusch 1.9
281 : gdpusch 1.6 <li> You now need to capture the assignments, annotations and
282 :     subsystems work that has been done on the production machine.
283 :     To do this, you need to know when the last production release
284 :     was installed. Suppose that it was July 1, 2004.
285 : gdpusch 1.9 If that was the date, we recommend that you run
286 : gdpusch 1.6 <pre>
287 :     <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004</b>
288 :     </pre>
289 : gdpusch 1.9
290 : gdpusch 1.6 This will capture your updates and save them in the directory
291 : gdpusch 1.9 /tmp/sync.data.july.1.2004.<br>
292 : gdpusch 1.10 <p>
293 : gdpusch 1.9
294 : overbeek 1.7 <li>Now, you need to stop the existing production system using
295 :     <pre>
296 :     ~/FIGdisk/bin/stop-servers
297 :     </pre>
298 : gdpusch 1.10 <p>
299 : overbeek 1.7
300 :     <li>Now, you need to configure the runtime environment for the system
301 :     you are running on.
302 :     To do this, use
303 :     <pre>
304 :     cd TargetDirectory
305 :     ./configure MacOrLinux
306 :     </pre>
307 :     where <b>MacOrLinux</b> must be a currently supported environment.
308 :     Those that are supported on July 24, 2004 are <b>mac</b> for
309 :     Macintoshes running panther, <b>mac-jaguar</b> for those that have not
310 :     upgraded to panther, and <b>linux-postgres</b>.
311 : gdpusch 1.10 <p>
312 : overbeek 1.7
313 :     <li>Now, you need to insert the new Data directory into the newly
314 :     constructed version of the SEED. To do this use
315 :     <pre>
316 : gdpusch 1.9 chmod -R 777 TheNewData
317 : overbeek 1.7 cd TargetDirectory/FIG
318 :     ln -s TheNewData Data
319 :     </pre>
320 :     where TheNewData is the new Data directory, which normally comes from the
321 :     update system. If you acquired a new Data directory via Data DVDs, you
322 :     will need to unpack them using the README instructions, but what
323 :     results is a new version of the <b>Data</b> directory.
324 : gdpusch 1.10 <p>
325 : gdpusch 1.9
326 : overbeek 1.7 <li>Now, you need to start the servers in order to load the databases
327 :     with the new release using
328 :     <pre>
329 :     cd TargetDirectory/bin
330 :     ./start-servers
331 :     cd ..
332 :     source config/fig-user-env.sh
333 :     init_FIG
334 :     fig load_all
335 :     </pre>
336 :     This last command will run for several hours.
337 : gdpusch 1.10 <p>
338 : gdpusch 1.9
339 : gdpusch 1.11 (<b>WARNING:</b> Please note that, because the new SEED's databases
340 :     do not yet exist, the `init_FIG` command will generate two totally
341 :     harmless but rather terrrifying error messages the very first time it is executed,
342 :     so that its output will look something like this:
343 :    
344 :     <pre>
345 :     DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL: Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
346 :    
347 :     Initializing new SEED database fig
348 :    
349 :     ERROR: DROP DATABASE: database "fig" does not exist
350 :     dropdb: database removal failed
351 :     CREATE DATABASE
352 :     NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
353 :     CREATE TABLE
354 :    
355 :     Complete. You will need to run "fig load_all" to load the data.
356 :     </pre>
357 :     We recognize that that generating the above two faux "FATAL" errors
358 :     constitutes a rather ugly and inelegant implementation,
359 :     but we have not yet found a more elegant database initialization method
360 :     that can avoid generating them.)
361 :     <p>
362 :    
363 : gdpusch 1.6 <li> Now, you need to capture the changes made to the old production
364 :     version using something like
365 :     <pre>
366 :     <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
367 :     </pre>
368 : gdpusch 1.10 <p>
369 :    
370 : overbeek 1.7 <li>Run
371 : gdpusch 1.10 <pre>
372 : overbeek 1.7 index_annotations
373 :     index_subsystems
374 :     make_indexes
375 : gdpusch 1.10 </pre>
376 :     <p>
377 : gdpusch 1.9
378 :     <li> Now, finally, you should alter the symbolic link in <i>~fig</i> to
379 : overbeek 1.7 the current FIGdisk using something like:
380 :     <pre>
381 :     cd ~fig
382 :     rm FIGdisk # should be removing a symbolic link to the current SEED
383 :     ln -s TargetDirectory FIGdisk
384 :     </pre>
385 :     That should make the new SEED the one available through the Web interface.
386 : gdpusch 1.10 <p>
387 : gdpusch 1.9
388 : gdpusch 1.6 <li> You should now bring your update system to the same state as the
389 :     production system. This can be done by making sure that
390 :     <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.
391 :     If the production and update systems are run on the same machine, then
392 :     the directory is already there. If not, copy it to <b>/tmp</b> on the
393 :     update machine. Then run
394 :     <br>
395 :     <pre>
396 :     <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
397 :     </pre>
398 :     <br>
399 :     on the update machine.
400 : olson 1.1 </ol>
401 : overbeek 1.7 <p>
402 :    
403 : olson 1.1 Our experience is that anytime a group wishes to share a common production environment,
404 : overbeek 1.4 this 2-system approach is the way to do it. You can, if necessary,
405 :     put both systems on the same physical machine. This does require some
406 :     special handling in setting up two different <b>FIGdisk</b>
407 :     directories. We recommend using <b>FIGdisk.production</b> and
408 :     <b>FIGdisk.update</b>. However, in general it makes sense to use two
409 :     separate physical machines, for backup if nothing else. The update
410 :     system can usually be run on a $2000 (or less) box, although it is
411 :     desirable to spend a little more and get at least 1 gigabyte of main
412 :     memory and 200 gigabytes of external disk.
413 : olson 1.1 <br>
414 : gdpusch 1.8 <h2 id="adding_genomes">Adding a New Genome to an Existing SEED</h2>
415 : olson 1.1 To add a new genome to a running SEED is fairly easy, but there are a
416 :     number of details that do have to be handled with care.
417 :     <p>
418 :     The first thing to note is that the SEED does not include tools to call genes -- you are expected
419 :     to provide gene calls. This may change at some point, but for now you must call your own genes. A
420 :     number of good tools now exist in the public domain, and you will need to find one that seems adequate
421 :     for your needs.
422 :     <p>
423 :     Let us now
424 :     cover how to prepare the actual data. You need to construct a directory (in somewhere like ~fig/Tmp)
425 :     of the following form:
426 :     <br>
427 :     <table width="100%">
428 :     <tr>
429 :     <td><tt>GenomeId</tt></td>
430 :     <td></td>
431 :     <td></td>
432 :     <td></td>
433 :     <td>of the form xxxx.y where xxxx is the taxon ID and y is an integer</td>
434 :     </tr>
435 :    
436 :     <tr>
437 :     <td></td>
438 :     <td><tt>PROJECT</tt></td>
439 :     <td></td>
440 :     <td></td>
441 :     <td> a file containg a description of the source of the data</td>
442 :     </tr>
443 :    
444 :     <tr>
445 :     <td></td>
446 :     <td><tt>GENOME</tt></td>
447 :     <td></td>
448 :     <td></td>
449 :     <td>a file containing a single line identifying the genus, species and strain</td>
450 :     </tr>
451 :    
452 :     <tr>
453 :     <td></td>
454 :     <td><tt>TAXONOMY</tt></td>
455 :     <td></td>
456 :     <td></td>
457 :     <td>a file containing a single line containing the NCBI taxonomy</td>
458 :     </tr>
459 :    
460 :     <tr>
461 :     <td></td>
462 :     <td><tt>RESTRICTIONS</tt></td>
463 :     <td></td>
464 :     <td></td>
465 :     <td>a file containing a description of distribution restrictions (optional)</td>
466 :     </tr>
467 :    
468 :     <tr>
469 :     <td></td>
470 :     <td><tt>CONTIGS</tt></td>
471 :     <td></td>
472 :     <td></td>
473 :     <td>contigs in fasta format</td>
474 :     </tr>
475 :    
476 :     <tr>
477 :     <td></td>
478 :     <td><tt>assigned_functions</tt></td>
479 :     <td></td>
480 :     <td></td>
481 :     <td>function assignments for the protein-encoding genes (optional)</td>
482 :     </tr>
483 :    
484 :     <tr>
485 :     <td></td>
486 :     <td><tt>Features</tt></td>
487 :     </tr>
488 :    
489 :     <tr>
490 :     <td></td>
491 :     <td></td>
492 :     <td><tt>peg</tt></td>
493 :     </tr>
494 :    
495 :     <tr>
496 :     <td></td>
497 :     <td></td>
498 :     <td></td>
499 :     <td><tt>tbl</tt></td>
500 :     <td>describes locations and aliases for the protein-encoding genes</td>
501 :     </td>
502 :     </tr>
503 :    
504 :     <tr>
505 :     <td></td>
506 :     <td></td>
507 :     <td></td>
508 :     <td><tt>fasta</tt></td>
509 :     <td>fasta file of translations of the protein-encoding genes</td>
510 :     </td>
511 :     </tr>
512 :    
513 :     <tr>
514 :     <td></td>
515 :     <td></td>
516 :     <td><tt>rna</tt></td>
517 :     </tr>
518 :    
519 :     <tr>
520 :     <td></td>
521 :     <td></td>
522 :     <td></td>
523 :     <td><tt>tbl</tt></td>
524 :     <td>describes locations and aliases for the rna-encoding genes</td>
525 :     </td>
526 :     </tr>
527 :    
528 :     <tr>
529 :     <td></td>
530 :     <td></td>
531 :     <td></td>
532 :     <td><tt>fasta</tt></td>
533 :     <td>fasta file of the DNA corresponding to the genes</td>
534 :     </td>
535 :     </tr>
536 :    
537 :    
538 :     </table>
539 :    
540 :     <!--
541 :    
542 :     <pre>
543 :     GenomeID of the form xxxx.y where xxxx is the taxon ID and y is an integer
544 :    
545 :     PROJECT a file containg a description of the source of the data
546 :    
547 :     GENOME a file containing a single line identifying the genus, species and strain
548 :    
549 :     TAXONOMY a file containing a single line containing the NCBI taxonomy
550 :    
551 :     RESTRICTIONS a file containing a description of distribution restrictions (optional)
552 :    
553 :     contigs contigs in fasta format
554 :    
555 :     assigned_functions function assignments for the protein-encoding genes (optional)
556 :    
557 :     Features
558 :    
559 :     peg
560 :     tbl descibes locations and aliases for the protein-encoding genes
561 :    
562 :     fasta fasta file of translations of the protein-encoding genes
563 :    
564 :     rna
565 :     tbl describes locations and aliases for the rna-encoding genes
566 :    
567 :     fasta fasta file of the DNA corresponding to the genes
568 :     </pre>
569 :     -->
570 :     <br>
571 :     <br>
572 :     Let us expand on this very brief description:
573 :     <ol>
574 :     <li>
575 :     The name of the directory must be of the form xxxx.y where xxxx is the
576 :     taxon ID, and y is a sequence number. For example, 562.1 might be
577 :     used for <i>E.coli</i>, since 562 is the NCBI taxon ID for
578 :     <i>Escherichia coli</i>. The sequence number (y) is used to
579 :     distinguish multiple genomes having the same taxon ID.
580 :     <br><br>
581 :     <li>
582 :     The assigned_functions file contains assignments of function for the
583 :     protein-encoding genes. is of the form
584 :     <pre>
585 :     Id\tFunction\tConfidence (\t stands for a tab character)
586 :     </pre>
587 :     The Id must be a valid PEG Id. These are of the form:
588 :     <pre>
589 :     fig|xxxx.y.peg.z
590 :     </pre>
591 :     where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes
592 :     the peg (protein-encoding gene).
593 :     <br>
594 :     <i>Confidence</i> is a single character code:
595 :     <br>
596 :     <ul>
597 :     <li>a space for "normal"
598 :     <li>w for "weak"
599 :     <li>e for experimentally verified
600 :     <li>s for "strong evidence (but not experimental)"
601 :     </ul>
602 :     The second tab and the confidence code can be omitted (it will default to a space).
603 :     The assigned_functions file is optional. You can leave it blank and, after adding the genome
604 :     to the SEED, ask for automated assignments.
605 :     <br><br>
606 :     <li>
607 :     The tbl files specify the locations of genes, as well as any aliases. Each line in a tbl line
608 :     is of the form
609 :     <br>
610 :     <pre>
611 :     Id\tLocation\tAliases (the aliases are separated by tabs)
612 :     </pre>
613 :     The Id must conform to the fig|xxxx.y.peg.z format described above. The <i>Location</i> is of the form
614 :     <br>
615 :     <pre>
616 :     L1,L2,L3...Ln
617 :    
618 :     where each Li describes a region on a contig and is of the form
619 :    
620 :     <i>Contig_Begin_End</i> where
621 :    
622 :     Contig is the Id of the contig,
623 :     Begin is the position of the first character, and
624 :     End is the position of the last character
625 :     </pre>
626 :     <ul>
627 :     <li>if Begin > End, the region being described is on the complementary strand, and
628 :     <li>the End position is the last character preceding the stop codon (i.e., the region
629 :     corresponding to a protein-encoding gene is thought of as including all bases from the
630 :     first base of the start codon to the last base before the stop codon.
631 :     </ul>
632 :     For example,
633 :     <pre>
634 :     fig|562.1.peg.15 Escherichia_coli_K12_14168_15295 dnaJ b0015 sp|P08622 gi|16128009
635 :     </pre>
636 :     describes the <i>dnaJ</i> gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12.
637 :     The gene is from the genome 562.1, and it has 4 specified aliases.
638 :     <li>
639 :     The fasta files must have gene Ids that match tbl file entries. The <i>peg</i> fasta file contains translations,
640 :     while the <i>rna</i> fasta file contains DNA sequences.
641 :     <li>
642 :     Both the <i>peg</i> and the <i>rna</i> subdirectories are optional.
643 :     </ol>
644 :     <br>
645 :     The SEED provides a utility that can be used to produce such a directory from a GenBank entry. Thus,
646 :     <br>
647 :     <pre>
648 :     parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
649 :     </pre>
650 :     would attempt to produce a properly formatted directory (~/Tmp/562.4) containing
651 :     the data encoded in the GenBank entry from the file <i>genbank.entry.for.a.new.E.coli.genome</i>.
652 :     This script is far from perfect, and there is huge variance in encodings in GenBank
653 :     files. So, use it at your own risk (and, manually check the output).
654 :     <p>
655 :     You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory
656 :     to see examples of how it should be done.
657 :     <p>
658 :     So, supposing that you have built a valid directory (say, <i>/Users/fig/Tmp/562.4</i>), you can add the genome using
659 :     <pre>
660 :     fig add_genome /Users/fig/Tmp/562.4
661 :     </pre>
662 :     <br>
663 :     The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
664 :     be computed for the protein-encoding genes.
665 :    
666 : overbeek 1.17 <h2 id="importing_external">Importing External Protein Data</h2>
667 :    
668 :     The presence of external judgements about the possible functions of encoded proteins
669 :     is one of the essential aspects of the SEED. It becomes important that one be able to
670 :     add new sources of annotation, as well as periodically updating the judgements of
671 :     existing sources. To update the external sets of proteins and annotations, build a new nonredundant
672 :     database of proteins, and compute the associated similarities, one should proceed as follows:
673 :    
674 :     <ol>
675 :     <li> Stop using the system until this procedure completes.
676 :     <br><br>
677 :     <li> Update the NR Directory
678 :     <br><br>
679 :     The <b>NR</b> directory is located within the <b>Data</b> directory:
680 :     <br>
681 :     <pre>
682 :     ~fig on a Mac: /Users/fig; on Linux: /home/fig
683 :     FIGdisk
684 :     dist source code
685 :     FIG
686 :     Tmp temporary files
687 :     Data data in readable form
688 :     NR Contains external Data
689 :    
690 :     </pre>
691 :    
692 :     The <b>NR</b> directory contains one subdirectory for each source of external
693 :     assignments (the released SEED includes subdirectories for SwissProt, NCBI, UniProt, and KEGG).
694 :     You may add more subdirectories.
695 :     <p>
696 :     Each subdirectory must include 3 files:
697 :     <ol>
698 :     <li> <b>fasta</b> should be a fasta file containing the protein sequences. These sequences will
699 :     be used to establish a correspondence between these IDs and other protein sequences within the SEED.
700 :     <br><br>
701 :     <li> <b>org.table</b> is a two-column, tab-separated table. Column 1 is the ID, and column 2 is the
702 :     organism corresponding to the ID.
703 :     <br><br>
704 :     <li> <b>assign_functions</b> is a 2-column table. The ID is in column 1, and column 2 contains the
705 :     gene function (often called a <i>product name</i>) asserted by the external source.
706 :     </ol>
707 :     <br>
708 :     You should proceed only when you have updated as many of the sources as you wish.
709 :     <br><br>
710 :     <li> Now run
711 :     <pre>
712 :     import_external_sequences_step1
713 :     </pre>
714 :    
715 :     This program will build a new nonredundant database, check to see what has changed, and will
716 :     build the input required to compute new similarities.
717 :     <br><br>
718 :     <li> Compute the needed similarities
719 :    
720 :     You will need three files to compute a new batch of similarities. The locations of these
721 :     three files are displayed by <b>import_external_sequences_step1</b> just before completion
722 :     (i.e., you should have gotten them as the output of the last step). Compute the similarities (see
723 :     the discussion below) and store them in the <b>NewSims</b> directory (again the precise location
724 :     was displayed by <b>import_external_sequences_step1</b>).
725 :     <br><br>
726 :     <li> Run
727 :     <pre>
728 :     import_external_sequences_step3
729 :     </pre>
730 :     </ol>
731 :    
732 : gdpusch 1.8 <h2 id="sims">Computing Similarities</h2>
733 : olson 1.1
734 : overbeek 1.15 Adding a genome does not automatically get similarities computed for the new genome.
735 :     To get the similarities actually computed, you need to compute them and make them available in
736 :     the <b>FIGdisk/FIG/Data/NewSims</b> directory.
737 : olson 1.1 <p>
738 : overbeek 1.15 To compute similarities, you will need to do the following:
739 : olson 1.1 <ol>
740 : overbeek 1.15 <li>The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in
741 :     <b>~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta</b>. A copy of this was appended to
742 :     <b>~fig/FIGdisk/FIG/Data/Global/nr</b> when your genome was added. <b>nr</b> is the "nonredundant database"
743 :     we use to compute similarities (and the one you must use). To get the initial blast results, you would use something
744 :     like
745 :     <br>
746 :     <pre>
747 :     blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
748 :     </pre>
749 : olson 1.1 <br>
750 : overbeek 1.15 which produces the blast results in a tab-separated format. The invocation of <b>reduce_sims</b> is optional.
751 :     It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
752 : olson 1.1 <li>
753 : overbeek 1.15 The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences. To add that, you would use
754 : olson 1.1 <br>
755 : overbeek 1.15 <pre>
756 :     reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
757 :     </pre>
758 :     <br>
759 :     This will actually append two columns to each similarity and place the results in the <b>NewSims</b>
760 :     directory where it should be.
761 :     </ol>
762 : overbeek 1.12 <p>
763 : overbeek 1.16 The above description will produce similarities using a single invocation of
764 :     blastall. For most large genomes, and whenever you wish to process a batch of genomes,
765 :     you should use parallel processing while maintaining the spirit of the approach.
766 : overbeek 1.12 No matter how you produce the new similarities, they need to be added
767 :     as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory. Then, you
768 :     need to index these similarities using
769 :     <pre>
770 :     index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
771 :     </pre>
772 :     where XXXX is the file you added. If you have more than one such
773 :     file, just put in several arguments for the command. This will
774 :     "index" the similarities in that any of the new PEGs which have
775 :     similarities connecting them to other PEGs from the existing genomes
776 :     can now be displayed. However, the connection from the existing
777 :     genomes to the new PEGs does not yet exist (we call these the "flips"
778 :     of the computed sims). To get this ability, you need to go through a
779 :     process that will make your system unavailable for a period (and, it
780 :     will produce a substantial load on your system for a day or so, while
781 :     the SEED sorts, sifts, inserts, and generally plays with the "flips").
782 :     <br>
783 :     The extra steps you need to take to make a fully functional version
784 :     are as follows:
785 :     <ol>
786 :     <li>
787 :     First, you need to run
788 :     <pre>
789 :     update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
790 :     </pre>
791 :     This should produce updated similarity files in a VERY BIG directory
792 : overbeek 1.14 that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
793 : overbeek 1.12 put anywhere). This may run as much as a day or so (and you can watch
794 :     its progress as it updates the similarity files).
795 :     <li>The next step is to replace the existing similarity files with the
796 :     newly computed ones. You need to make the SEED unavailable (via the
797 :     <b>SEED Control Panel</b>.
798 :     <li>Then, blow away the existing similarities using something like
799 :     <pre>
800 :     rm ~/FIGdisk/FIG/Data/Sims/*
801 :     rm ~/FIGdisk/FIG/Data/NewSims/*
802 :     cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
803 :     rm -r ~/Tmp/FlippedSims
804 :     </pre>
805 :     There are several ways to do this. You might want to save the old
806 :     similarities somewhere. You might be able to move (rather than copy),
807 :     the similarities. Whatever suits you.
808 :     <li> Then run
809 :     <pre>
810 :     index_sims
811 :     </pre>
812 :     to re-index all of the similarities, and you should be fully
813 :     operational.
814 :     </ol>
815 :     <br>
816 : olson 1.1
817 : gdpusch 1.8 <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>
818 : olson 1.1
819 :     There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
820 :     when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
821 :     deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy
822 :     of the SEED containing a subset of the entire collection of genomes.
823 :     <p>
824 :     To delete a set of genomes from a running version of the SEED, just use
825 :     <pre>
826 : overbeek 1.18 fig mark_deleted_genomes User G1 G2 ...Gn (where G1 G2 ... Gn designates a list of genomes)
827 : olson 1.1 </pre>
828 :     For example,
829 :     <pre>
830 : overbeek 1.18 fig mark_deleted_genomes RossO 562.1
831 : olson 1.1 </pre>
832 :     could be used to delete a single genome with a genome ID of 562.1.
833 :    
834 : gdpusch 1.8 <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>
835 : olson 1.1
836 :     When the initial SEED was constructed, similarities were computed. For most similarities of the form
837 :     "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2. This is not always true,
838 :     since we truncate the number of similarities associated with any single Id (leaving us in a situation
839 :     in which we may have similarity recorded for Id1, but not Id2). When a genome is added, if Id1 was an added
840 :     protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2. This means that when looking
841 :     at genes from previously existing organisms, you never get links back to the added pegs. This is not totally
842 :     satisfactory.
843 :     <p>
844 :     Periodically, it is probably a good idea to "reinitegrate the similarities". This can be done by
845 :     just running
846 :     <pre>
847 :     reintegrate_sims
848 :     # update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
849 :     </pre>
850 :     The job will probably run for quite a while (perhaps as much as a day or two).
851 :    
852 : gdpusch 1.8 <h2 id="pins_and_clusters">Computing "Pins" and "Clusters"</h2>
853 : olson 1.1
854 :     The SEED displays potentially significant clusters on prokaryotic chromosomes. In the
855 :     process of finding preserved contiguity, it computes "pins", which are simply a set of genes
856 :     that are believed to be orthologs that cluster with similar genes. If you add your own genome,
857 :     you will probably want to compute and enter these into the active database. This can be done
858 :     using
859 :     <pre>
860 :     compute_pins_and_clusters G1 G2 G3 ...
861 :     </pre>
862 :     where the arguments are genome Ids. Thus,
863 :     <pre>
864 :     compute_pins_and_clusters 562.4
865 :     </pre>
866 :     would compute and add entries for all of the <i>pegs</i> in genome 562.4.
867 : gdpusch 1.13
868 :     <h2 id="auto_annotation">
869 :     Automatic Annotation of Genomes
870 :     </h2>
871 :     The SEED provides a simple but limited capability for automated assignment
872 :     of protein-encoding gene function based on similarity.
873 :     Candidate functions are assigned scores based on the combined strengths
874 :     of all BLASTP similarities to genes carrying that particular assignment,
875 :     weighted by the provenance and assignment-confidence for each similar gene.
876 :     The final automated function assignment is then determined from the
877 :     list of candidate functions and their associated scores.
878 :    
879 :     Automated assignment is a four-step process:
880 :     <ol>
881 :     <li> Create a list of PEGs to be automatically assigned.
882 :     If one wishes to make assignments to an entire organism or set of organisms
883 :     that are already installed in the SEED, the simplest method for creating
884 :     this list is to type the following command:
885 :     <pre>
886 :     pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
887 :     </pre>
888 :    
889 :     <p>
890 :     <li> Next, create a list of candidate function-assignments using the following
891 :     command:
892 :     <pre>
893 :     auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
894 :     </pre>
895 :     (NOTE: The `auto_assign` command has some additional optional parameters;
896 :     for example, if one knows that all the PEGs in 'peg.list' are from
897 :     prokaryotic organisms, one can make use of this additional informaation
898 :     by invoking `auto_assign` as follows:
899 :     <pre>
900 :     auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
901 :     </pre>
902 :     Also, if one wishes to use an alternate file of similarity data named 'simfile'
903 :     instead of the precomputed similarities stored in the SEED, one can instead type:
904 :     <pre>
905 :     auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
906 :     </pre>
907 :     Finally, `auto_assign` can read a set of alternate parameters from a file,
908 :     but we recommend that you stick with the default settings, and not exploit this
909 :     last feature unless you are a qualified SEED wizard.)
910 :     <p>
911 :    
912 :     <li> Next, create a SEED format assigned-functions file as follows:
913 :     <pre>
914 :     make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
915 :     </pre>
916 :     Alternately, if you wish to suppress the class of "non-informative" function assignments
917 :     such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
918 :     you may do so using the '-no_hypos' flag:
919 :     <pre>
920 :     make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
921 :     </pre>
922 :    
923 :     <li> Finally, install the automated assignments in the seed using the command
924 :     <pre>
925 : overbeek 1.17 fig assign_functionF master:automated_assignments ~/Tmp/assigned_functions
926 : gdpusch 1.13 </pre>
927 :    
928 :     </ol>
929 :    
930 :     It should be once again noted that the SEED's automated assignment algorithm
931 :     is quite simple and crude, being only slightly better than simply assigning
932 :     the function of the highest-scoring BLASTP hit; however, it at least provides
933 :     a "quick and dirty" starting point for making an initial assessment of a genome,
934 : overbeek 1.17 which may then be cleaned up and refined by skilled genome annotators.
935 : gdpusch 1.13
936 :    
937 :    
938 :    
939 :    
940 :    

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3