[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Diff of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.5, Wed Jul 21 20:25:42 2004 UTC revision 1.14, Mon Nov 8 19:13:58 2004 UTC
# Line 1  Line 1 
1  <h1>SEED Administration</h1>  <h1>SEED Administration</h1>
2  <p>This tutorial discusses a number of issues that you will need to know about  
3    in order to install, share, and maintain your SEED installation.</p>  <p>
4  <h2>Backing Up Your Data</h2>  This tutorial discusses a number of issues that you will need to know about
5    in order to install, share, and maintain your SEED installation.
6    It is organized as follows:
7    </p>
8    
9    <ul>
10    <li><A HREF="#backups">
11         Backing Up Your Data
12    </A>
13    
14    <li><A HREF="#copying">
15         Copying a Version of the SEED
16    </A>
17    
18    <li><A HREF="#multiple_copies">
19         Running Multiple Copies of the SEED
20    </A>
21    
22    <li><A HREF="#adding_genomes">
23    Adding a New Genome to an Existing SEED
24    </A>
25    
26    <li><A HREF="#sims">
27        Computing Similarities
28    </A>
29    
30    <li><A HREF="#deleting_genomes">
31        Deleting Genomes from a Version of the SEED
32    </A>
33    
34    <li><A HREF="#reintegrate_sims">
35        Periodic Reintegration of Similarities
36    </A>
37    
38    <li><A HREF="#pins_and_clusters">
39        Computing "Pins" and "Clusters"
40    </A>
41    
42    <li><A HREF="#auto_annotation">
43        Automatic Annotation of Genomes
44    </A>
45    
46    </ul>
47    
48    
49    <h2 id="backups">Backing Up Your Data</h2>
50  The data and code stored within the SEED are organized as follows:  The data and code stored within the SEED are organized as follows:
51  <pre>  <pre>
52          ~fig                                 on a Mac: /Users/fig; on Linux: /home/fig          ~fig                                 on a Mac: /Users/fig; on Linux: /home/fig
# Line 49  Line 94 
94  /Volumes/Backup is a backup disk.  Then,  /Volumes/Backup is a backup disk.  Then,
95  <br>  <br>
96  <pre>  <pre>
97          cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup          cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
98          gzip -r /Volumes/Backup/Data.Backup          gzip -r /Volumes/Backup/Data.Backup
99  </pre>  </pre>
100  <br>  <br>
101  would be a reasonable way to make a backup.  The copy preserves  would be a reasonable way to make a backup.  The copy preserves
102  permissions, copies recursively, and does not follow symbolic links.  permissions, copies recursively, and does not follow symbolic links.
103  <br>  <br>
104  <h2>Copying a Version of the SEED</h2>  <h2 id="copying">Copying a Version of the SEED</h2>
105    
106  To make a second copy of the SEED (either for a friend or for yourself), you should use tar  To make a second copy of the SEED (either for a friend or for yourself), you should use tar
107  to preserve a few symbolic links (which are relative, not absolute; this means that they can  to preserve a few symbolic links (which are relative, not absolute; this means that they can
# Line 156  Line 201 
201  <blockquote>  <blockquote>
202    <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions">      http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>    <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions">      http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
203  </blockquote>  </blockquote>
204  <h2>Running Multiple Copies of the SEED</h2>  <h2 id="multiple_copies">Running Multiple Copies of the SEED</h2>
205    
206  For individual users that use the SEED to support comparative analysis, a single copy is completely  For individual users that use the SEED to support comparative analysis, a single copy is completely
207  adequate.  Adding genomes can usually be done without disrupting normal use, and a very occasional major  adequate.  Adding genomes can usually be done without disrupting normal use, and a very occasional major
# Line 166  Line 211 
211  effort.  In this case, you have a user community that is sensitive to disruptions of service, and you  effort.  In this case, you have a user community that is sensitive to disruptions of service, and you
212  have frequent demands to update versions of data.  In this case, it is best to have two systems: the  have frequent demands to update versions of data.  In this case, it is best to have two systems: the
213  <b>production system</b> is used to support the larger user community, and the <b>update system</b> is  <b>production system</b> is used to support the larger user community, and the <b>update system</b> is
214  used to prepare updated versions of the system.  Even so, work stoppages of 4-8 hours will occur when  used to prepare updated versions of the system.
215  new releases are swapped in.  To swap in new data from the update system to the production system,  New genomes are added to the update system, and then periodically a
216  you need to  revised Data directory is extracted to update the production system.
217    Even so, work stoppages of a few hours will occur when
218    new releases are swapped in.
219    <p>
220    This use of an "update" and a "production" system is quite analogous
221    to running a production system which is occasionally updated from new
222    Data DVDs (which FIG normally makes available about every 4-6 months).
223    That is, in both cases you are updating a production system from a
224    newly created <b>Data</b> directory that is lacking assignments and
225    annotations that exist on your production system.  However, if you have
226    added new genomes to the production system (that are not part of the
227    releases you may acquire via DVDs), you should get the new release,
228    install the versions of your local genomes, and then do this update
229    procedure.
230    <p>
231    The plan we propose is to build a completely encapsulated new version
232    of the system, then capture updates from the old production system, update
233    the new production system, and then make the new version the actual
234    production system.  This last step amounts to altering a symbolic link
235    to point at the new production system rather than the old.  This has
236    the virtue of ease of recovery -- that is, if something goes wrong you
237    can flip back to the old system.
238    The actual steps are as follows:
239  <ol>  <ol>
240  <li>stop all work on the production machine by clicking on the "Seed Control Panel" link,  
241  entering an explanatory message in the text box, and clicking on the "Disable SEED server" button.  <li> First, make sure that you are in the BASH shell by typing "echo $SHELL";
242       if the result is not "bash", type "bash" to enter the BASH shell.
243    <p>
244    
245    <li> Next, check that the result of typing "which perl" is the version
246       of perl owned by the SEED; it should look something like
247       <pre>
248           /Users/fig/FIGdisk/env/mac/bin/perl
249       </pre>
250       although the exact results will depend on where your existing copy
251       of the SEED is installed, whether your platform is a Macintosh or LINUX,
252       etc. If the result does not look similar to the above, type:
253       <pre>
254           source Path_to_FIGdisk/config/fig-user-env.sh
255       </pre>
256       to setup your FIG environment properly.
257    <p>
258    
259    <li> Next, make a copy of the Code Distribution Environment (from a DVD
260    or via the network).  Suppose that we have made such a directory in
261    CodeDistEnv.  Then use,
262    <pre>
263            cd CodeDistEnv
264            ./install-code TargetDirectory
265    </pre>
266    where <b>TargetDirectory</b> is where you wish to build the new
267    production version.  We recommend calling it something like
268    <b>FIGdisk.July24</b>.
269    <p>
270    
271    <li> Stop all work on the production machine for the duration of the update.
272         You do this by clicking on the "Seed Control Panel" link,
273         and then entering an explanatory message in the text box
274         and clicking on the "Disable SEED server" button.
275    <p>
276    
277  <li>You now need to capture the assignments, annotations and  <li>You now need to capture the assignments, annotations and
278  subsystems work that has been done on the production machine.  subsystems work that has been done on the production machine.
279  To do this, you need to know when the last production release  To do this, you need to know when the last production release
280  was installed.  Suppose that it was July 1, 2004.  was installed.  Suppose that it was July 1, 2004.
281  If that was the date, we recommend that you run<br><br>       If that was the date, we recommend that you run
282  <pre>  <pre>
283      <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004<</b>          <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004</b>
284  </pre>  </pre>
285  <br><br>  
286  This will capture your updates and save them in the directory  This will capture your updates and save them in the directory
287  /tmp/sync.data.july.1.2004.       /tmp/sync.data.july.1.2004.<br>
288  <li>Now, you need to replace your <b>Data</b> directory (within  <p>
289  <b>FIGdisk/FIG</b>) with the new version from the update system.  We  
290  suggest that you do the following:  <li>Now, you need to stop the existing production system using
291  <ol>  <pre>
292  <li>archive the existing <b>Data</b> directory.  These can usually be          ~/FIGdisk/bin/stop-servers
293  discarded within a month or two, but keeping them around is a good  </pre>
294  safety measure.  <p>
295  <li>move a copy of the update <b>Data</b> directory into the  
296  <b>FIGdisk/FIG</b> directory.  <li>Now, you need to configure the runtime environment for the system
297  </ol>  you are running on.
298  At this point, you have a version of the data from the update system  To do this, use
299  in the right location, but the internal databases all contain the old data.  <pre>
300  <li> Now, run          cd TargetDirectory
301            ./configure MacOrLinux
302    </pre>
303    where <b>MacOrLinux</b> must be a currently supported environment.
304    Those that are supported on July 24, 2004 are <b>mac</b> for
305    Macintoshes running panther, <b>mac-jaguar</b> for those that have not
306    upgraded to panther, and <b>linux-postgres</b>.
307    <p>
308    
309    <li>Now, you need to insert the new Data directory into the newly
310    constructed version of the SEED.  To do this use
311    <pre>
312            chmod -R 777 TheNewData
313            cd TargetDirectory/FIG
314            ln -s TheNewData Data
315    </pre>
316    where TheNewData is the new Data directory, which normally comes  from the
317    update system.  If you acquired a new Data directory via Data DVDs, you
318    will need to unpack them using the README instructions, but what
319    results is a new version of the <b>Data</b> directory.
320    <p>
321    
322    <li>Now, you need to start the servers in order to load the databases
323    with the new release using
324  <pre>  <pre>
325          <b>fig load_all</b>          cd TargetDirectory/bin
326            ./start-servers
327            cd ..
328            source config/fig-user-env.sh
329            init_FIG
330            fig load_all
331  </pre>  </pre>
332  to reload the production databases with the data from the newly inserted Data directory.  This last command will run for several hours.
333  This will usually take several hours.  <p>
334    
335    (<b>WARNING:</b> Please note that, because the new SEED's databases
336    do not yet exist, the `init_FIG` command will generate two totally
337    harmless but rather terrrifying error messages the very first time it is executed,
338    so that its output will look something like this:
339    
340    <pre>
341    DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL:  Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
342    
343    Initializing new SEED database fig
344    
345    ERROR:  DROP DATABASE: database "fig" does not exist
346    dropdb: database removal failed
347    CREATE DATABASE
348    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
349    CREATE TABLE
350    
351    Complete. You will need to run "fig load_all" to load the data.
352    </pre>
353    We recognize that that generating the above two faux "FATAL" errors
354    constitutes a rather ugly and inelegant implementation,
355    but we have not yet found a more elegant database initialization method
356    that can avoid generating them.)
357    <p>
358    
359  <li>Now, you need to capture the changes made to the old production  <li>Now, you need to capture the changes made to the old production
360  version using something like  version using something like
 <br>  
361  <pre>  <pre>
362          <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>          <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
363  </pre>  </pre>
364  <br>  <p>
365  <li> make the production machine available for use.  
366    <li>Run
367        <pre>
368            index_annotations
369            index_subsystems
370            make_indexes
371        </pre>
372    <p>
373    
374    <li> Now, finally, you should alter the symbolic link in <i>~fig</i> to
375    the current FIGdisk using something like:
376    <pre>
377            cd ~fig
378            rm FIGdisk     # should be removing a symbolic link to the current SEED
379            ln -s TargetDirectory FIGdisk
380    </pre>
381    That should make the new SEED the one available through the Web interface.
382    <p>
383    
384  <li>You should now bring your update system to the same state as the  <li>You should now bring your update system to the same state as the
385  production system.  This can be done by making sure that  production system.  This can be done by making sure that
386  <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.  <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.
# Line 222  Line 394 
394  <br>  <br>
395  on the update machine.  on the update machine.
396  </ol>  </ol>
397    <p>
398    
399  Our experience is that anytime a group wishes to share a common production environment,  Our experience is that anytime a group wishes to share a common production environment,
400  this 2-system approach is the way to do it.  You can, if necessary,  this 2-system approach is the way to do it.  You can, if necessary,
401  put both systems on the same physical machine.  This does require some  put both systems on the same physical machine.  This does require some
# Line 233  Line 407 
407  desirable to spend a little more and get at least 1 gigabyte of main  desirable to spend a little more and get at least 1 gigabyte of main
408  memory and 200 gigabytes of external disk.  memory and 200 gigabytes of external disk.
409  <br>  <br>
410  <h2>Adding a New Genome to an Existing SEED</h2>  <h2 id="adding_genomes">Adding a New Genome to an Existing SEED</h2>
411  To add a new genome to a running SEED is fairly easy, but there are a  To add a new genome to a running SEED is fairly easy, but there are a
412  number of details that do have to be handled with care.  number of details that do have to be handled with care.
413  <p>  <p>
# Line 485  Line 659 
659  The <i>add_genome</i> request will add your new genome and queue a computational request that similarities  The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
660  be computed for the protein-encoding genes.  be computed for the protein-encoding genes.
661    
662  <h2>Computing Similarities</h2>  <h2 id="sims">Computing Similarities</h2>
663    
664  Adding a genome does not automatically get similarities computed for the new genome; it queues the request.  Adding a genome does not automatically get similarities computed for the new genome; it queues the request.
665  To get the similarities actually computed, you need to establish a computational environment on which  To get the similarities actually computed, you need to establish a computational environment on which
# Line 535  Line 709 
709  We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined  We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined
710  interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up  interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up
711  your configuration to exploit these resources.  your configuration to exploit these resources.
712    <p>
713    No matter how you produce the new similarities, they need to be added
714    as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you
715    need to index these similarities using
716    <pre>
717            index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
718    </pre>
719    where XXXX is the file you added.  If you have more than one such
720    file, just put in several arguments for the command.  This will
721    "index" the similarities in that any of the new PEGs which have
722    similarities connecting them to other PEGs from the existing genomes
723    can now be displayed.  However, the connection from the existing
724    genomes to the new PEGs does not yet exist (we call these the "flips"
725    of the computed sims).  To get this ability, you need to go through a
726    process that will make your system unavailable for a period (and, it
727    will produce a substantial load on your system for a day or so, while
728    the SEED sorts, sifts, inserts, and generally plays with the "flips").
729    <br>
730    The extra steps you need to take to make a fully functional version
731    are as follows:
732    <ol>
733    <li>
734    First, you need to run
735    <pre>
736            update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
737    </pre>
738    This should produce updated similarity files in a VERY BIG directory
739    that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
740    put anywhere).  This may run as much as a day or so (and you can watch
741    its progress as it updates the similarity files).
742    <li>The next step is to replace the existing similarity files with the
743    newly computed ones.  You need to make the SEED unavailable (via the
744    <b>SEED Control Panel</b>.
745    <li>Then, blow away the existing similarities using something like
746    <pre>
747            rm ~/FIGdisk/FIG/Data/Sims/*
748            rm ~/FIGdisk/FIG/Data/NewSims/*
749            cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
750            rm -r ~/Tmp/FlippedSims
751    </pre>
752    There are several ways to do this.  You might want to save the old
753    similarities somewhere.  You might be able to move (rather than copy),
754    the similarities.  Whatever suits you.
755    <li> Then run
756    <pre>
757            index_sims
758    </pre>
759    to re-index all of the similarities, and you should be fully
760    operational.
761    </ol>
762    <br>
763    
764  <h2>Deleting Genomes from a Version of the SEED </h2>  <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>
765    
766  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
767  when you wish to replace an existing version of a genome (in which case the replacement is viewed as first  when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
# Line 564  Line 789 
789  the path to the current Data directory.  The third argument specifies the name of a directory  the path to the current Data directory.  The third argument specifies the name of a directory
790  that is created holding the extraction.  Thus,  that is created holding the extraction.  Thus,
791  <pre>  <pre>
792          extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData          extract_genomes unrestricted ~/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
793  </pre>  </pre>
794  would created the extracted Data directory for you.  If you wish to then produce a fully distributable  would created the extracted Data directory for you.  If you wish to then produce a fully distributable
795  version of the SEED from the existing version and the extracted Data directory, you would  version of the SEED from the existing version and the extracted Data directory, you would
796  use  use
797  <pre>  <pre>
798          make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo          make_a_SEED ~/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
799          rm -rf /Volumes/Tmp/ExtractedData          rm -rf /Volumes/Tmp/ExtractedData
800  </pre>  </pre>
801    
802  <h2>Periodic Reintegration of Similarities</h2>   <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>
803    
804  When the initial SEED was constructed, similarities were computed.  For most similarities of the form  When the initial SEED was constructed, similarities were computed.  For most similarities of the form
805  "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2.  This is not always true,  "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2.  This is not always true,
# Line 592  Line 817 
817  </pre>  </pre>
818  The job will probably run for quite a while (perhaps as much as a day or two).  The job will probably run for quite a while (perhaps as much as a day or two).
819    
820  <h2>Computing "Pins" and "Clusters"</h2>  <h2 id="pins_and_clusters">Computing "Pins" and "Clusters"</h2>
821    
822  The SEED displays potentially significant clusters on prokaryotic chromosomes.  In the  The SEED displays potentially significant clusters on prokaryotic chromosomes.  In the
823  process of finding preserved contiguity, it computes "pins", which are simply a set of genes  process of finding preserved contiguity, it computes "pins", which are simply a set of genes
# Line 607  Line 832 
832          compute_pins_and_clusters 562.4          compute_pins_and_clusters 562.4
833  </pre>  </pre>
834  would compute and add entries for all of the <i>pegs</i> in genome 562.4.  would compute and add entries for all of the <i>pegs</i> in genome 562.4.
835    
836    <h2 id="auto_annotation">
837       Automatic Annotation of Genomes
838    </h2>
839    The SEED provides a simple but limited capability for automated assignment
840    of protein-encoding gene function based on similarity.
841    Candidate functions are assigned scores based on the combined strengths
842    of all BLASTP similarities to genes carrying that particular assignment,
843    weighted by the provenance and assignment-confidence for each similar gene.
844    The final automated function assignment is then determined from the
845    list of candidate functions and their associated scores.
846    
847    Automated assignment is a four-step process:
848    <ol>
849    <li> Create a list of PEGs to be automatically assigned.
850    If one wishes to make assignments to an entire organism or set of organisms
851    that are already installed in the SEED, the simplest method for creating
852    this list is to type the following command:
853    <pre>
854        pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
855    </pre>
856    
857    <p>
858    <li> Next, create a list of candidate function-assignments using the following
859    command:
860    <pre>
861       auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
862    </pre>
863    (NOTE: The `auto_assign` command has some additional optional parameters;
864    for example, if one knows that all the PEGs in 'peg.list' are from
865    prokaryotic organisms, one can make use of this additional informaation
866    by invoking `auto_assign` as follows:
867    <pre>
868       auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
869    </pre>
870    Also, if one wishes to use an alternate file of similarity data named 'simfile'
871    instead of the precomputed similarities stored in the SEED, one can instead type:
872    <pre>
873       auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
874    </pre>
875    Finally, `auto_assign` can read a set of alternate parameters from a file,
876    but we recommend that you stick with the default settings, and not exploit this
877    last feature unless you are a qualified SEED wizard.)
878    <p>
879    
880    <li> Next, create a SEED format assigned-functions file as follows:
881    <pre>
882        make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
883    </pre>
884    Alternately, if you wish to suppress the class of "non-informative" function assignments
885    such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
886    you may do so using the '-no_hypos' flag:
887    <pre>
888        make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
889    </pre>
890    
891    <li> Finally, install the automated assignments in the seed using the command
892    <pre>
893        fig auto_assignF ~/Tmp/assigned_functions
894    </pre>
895    
896    </ol>
897    
898    It should be once again noted that the SEED's automated assignment algorithm
899    is quite simple and crude, being only slightly better than simply assigning
900    the function of the highest-scoring BLASTP hit; however, it at least provides
901    a "quick and dirty" starting point for making an initial assessment of a genome,
902    which may then be clraned up and refined by skilled genome annotators.
903    
904    
905    
906    
907    
908    

Legend:
Removed from v.1.5  
changed lines
  Added in v.1.14

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3