[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Diff of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.9, Fri Jul 30 19:21:08 2004 UTC revision 1.18, Wed Jul 19 17:36:16 2006 UTC
# Line 23  Line 23 
23  Adding a New Genome to an Existing SEED  Adding a New Genome to an Existing SEED
24  </A>  </A>
25    
26    <li><A HREF="#importing_external">
27    Importing External Protein Data
28    </A>
29    
30  <li><A HREF="#sims">  <li><A HREF="#sims">
31      Computing Similarities      Computing Similarities
32  </A>  </A>
# Line 39  Line 43 
43      Computing "Pins" and "Clusters"      Computing "Pins" and "Clusters"
44  </A>  </A>
45    
46    <li><A HREF="#auto_annotation">
47        Automatic Annotation of Genomes
48    </A>
49    
50  </ul>  </ul>
51    
52    
# Line 90  Line 98 
98  /Volumes/Backup is a backup disk.  Then,  /Volumes/Backup is a backup disk.  Then,
99  <br>  <br>
100  <pre>  <pre>
101          cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup          cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
102          gzip -r /Volumes/Backup/Data.Backup          gzip -r /Volumes/Backup/Data.Backup
103  </pre>  </pre>
104  <br>  <br>
# Line 236  Line 244 
244    
245  <li> First, make sure that you are in the BASH shell by typing "echo $SHELL";  <li> First, make sure that you are in the BASH shell by typing "echo $SHELL";
246     if the result is not "bash", type "bash" to enter the BASH shell.     if the result is not "bash", type "bash" to enter the BASH shell.
247    <p>
248    
249  <li> Next, check that the result of typing "which perl" is the version  <li> Next, check that the result of typing "which perl" is the version
250     of perl owned by the SEED; it should look something like     of perl owned by the SEED; it should look something like
# Line 249  Line 258 
258         source Path_to_FIGdisk/config/fig-user-env.sh         source Path_to_FIGdisk/config/fig-user-env.sh
259     </pre>     </pre>
260     to setup your FIG environment properly.     to setup your FIG environment properly.
261    <p>
262    
263  <li> Next, make a copy of the Code Distribution Environment (from a DVD  <li> Next, make a copy of the Code Distribution Environment (from a DVD
264  or via the network).  Suppose that we have made such a directory in  or via the network).  Suppose that we have made such a directory in
# Line 260  Line 270 
270  where <b>TargetDirectory</b> is where you wish to build the new  where <b>TargetDirectory</b> is where you wish to build the new
271  production version.  We recommend calling it something like  production version.  We recommend calling it something like
272  <b>FIGdisk.July24</b>.  <b>FIGdisk.July24</b>.
273    <p>
274    
275  <li> Stop all work on the production machine for the duration of the update.  <li> Stop all work on the production machine for the duration of the update.
276       You do this by clicking on the "Seed Control Panel" link,       You do this by clicking on the "Seed Control Panel" link,
277       and then entering an explanatory message in the text box       and then entering an explanatory message in the text box
278       and clicking on the "Disable SEED server" button.       and clicking on the "Disable SEED server" button.
279    <p>
280    
281  <li> You now need to capture the assignments, annotations and  <li> You now need to capture the assignments, annotations and
282       subsystems work that has been done on the production machine.       subsystems work that has been done on the production machine.
# Line 277  Line 289 
289    
290       This will capture your updates and save them in the directory       This will capture your updates and save them in the directory
291       /tmp/sync.data.july.1.2004.<br>       /tmp/sync.data.july.1.2004.<br>
292    <p>
293    
294  <li>Now, you need to stop the existing production system using  <li>Now, you need to stop the existing production system using
295  <pre>  <pre>
296          ~/FIGdisk/bin/stop-servers          ~/FIGdisk/bin/stop-servers
297  </pre>  </pre>
298    <p>
299    
300  <li>Now, you need to configure the runtime environment for the system  <li>Now, you need to configure the runtime environment for the system
301  you are running on.  you are running on.
# Line 294  Line 308 
308  Those that are supported on July 24, 2004 are <b>mac</b> for  Those that are supported on July 24, 2004 are <b>mac</b> for
309  Macintoshes running panther, <b>mac-jaguar</b> for those that have not  Macintoshes running panther, <b>mac-jaguar</b> for those that have not
310  upgraded to panther, and <b>linux-postgres</b>.  upgraded to panther, and <b>linux-postgres</b>.
311    <p>
312    
313  <li>Now, you need to insert the new Data directory into the newly  <li>Now, you need to insert the new Data directory into the newly
314  constructed version of the SEED.  To do this use  constructed version of the SEED.  To do this use
# Line 306  Line 321 
321  update system.  If you acquired a new Data directory via Data DVDs, you  update system.  If you acquired a new Data directory via Data DVDs, you
322  will need to unpack them using the README instructions, but what  will need to unpack them using the README instructions, but what
323  results is a new version of the <b>Data</b> directory.  results is a new version of the <b>Data</b> directory.
324    <p>
325    
326  <li>Now, you need to start the servers in order to load the databases  <li>Now, you need to start the servers in order to load the databases
327  with the new release using  with the new release using
# Line 318  Line 334 
334          fig load_all          fig load_all
335  </pre>  </pre>
336  This last command will run for several hours.  This last command will run for several hours.
337    <p>
338    
339    (<b>WARNING:</b> Please note that, because the new SEED's databases
340    do not yet exist, the `init_FIG` command will generate two totally
341    harmless but rather terrrifying error messages the very first time it is executed,
342    so that its output will look something like this:
343    
344    <pre>
345    DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL:  Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
346    
347    Initializing new SEED database fig
348    
349    ERROR:  DROP DATABASE: database "fig" does not exist
350    dropdb: database removal failed
351    CREATE DATABASE
352    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
353    CREATE TABLE
354    
355    Complete. You will need to run "fig load_all" to load the data.
356    </pre>
357    We recognize that that generating the above two faux "FATAL" errors
358    constitutes a rather ugly and inelegant implementation,
359    but we have not yet found a more elegant database initialization method
360    that can avoid generating them.)
361    <p>
362    
363  <li> Now, you need to capture the changes made to the old production  <li> Now, you need to capture the changes made to the old production
364       version using something like       version using something like
365       <pre>       <pre>
366           <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>           <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
367       </pre>       </pre>
368    <p>
369    
370  <li>Run  <li>Run
371  <pre>  <pre>
372          index_annotations          index_annotations
373          index_subsystems          index_subsystems
374          make_indexes          make_indexes
375  </pre>  </pre>
376    <p>
377    
378  <li> Now, finally, you should alter the symbolic link in <i>~fig</i> to  <li> Now, finally, you should alter the symbolic link in <i>~fig</i> to
379  the current FIGdisk using something like:  the current FIGdisk using something like:
# Line 339  Line 383 
383          ln -s TargetDirectory FIGdisk          ln -s TargetDirectory FIGdisk
384  </pre>  </pre>
385  That should make the new SEED the one available through the Web interface.  That should make the new SEED the one available through the Web interface.
386    <p>
387    
388  <li> You should now bring your update system to the same state as the  <li> You should now bring your update system to the same state as the
389       production system.  This can be done by making sure that       production system.  This can be done by making sure that
# Line 618  Line 663 
663  The <i>add_genome</i> request will add your new genome and queue a computational request that similarities  The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
664  be computed for the protein-encoding genes.  be computed for the protein-encoding genes.
665    
666    <h2 id="importing_external">Importing External Protein Data</h2>
667    
668    The presence of external judgements about the possible functions of encoded proteins
669    is one of the essential aspects of the SEED.  It becomes important that one be able to
670    add new sources of annotation, as well as periodically updating the judgements of
671    existing sources.  To update the external sets of proteins and annotations, build a new nonredundant
672    database of proteins, and compute the associated similarities, one should proceed as follows:
673    
674    <ol>
675    <li> Stop using the system until this procedure completes.
676    <br><br>
677    <li> Update the NR Directory
678    <br><br>
679    The <b>NR</b> directory is located within the <b>Data</b> directory:
680    <br>
681    <pre>
682            ~fig                                      on a Mac: /Users/fig; on Linux: /home/fig
683                    FIGdisk
684                            dist                      source code
685                            FIG
686                                    Tmp               temporary files
687                                    Data              data in readable form
688                                              NR      Contains external Data
689    
690    </pre>
691    
692    The <b>NR</b> directory contains one subdirectory for each source of external
693    assignments (the released SEED includes subdirectories for SwissProt, NCBI, UniProt, and KEGG).
694    You may add more subdirectories.
695    <p>
696    Each subdirectory must include 3 files:
697    <ol>
698    <li> <b>fasta</b> should be a fasta file containing the protein sequences.  These sequences will
699    be used to establish a correspondence between these IDs and other protein sequences within the SEED.
700    <br><br>
701    <li> <b>org.table</b> is a two-column, tab-separated table.  Column 1 is the ID, and column 2 is the
702    organism corresponding to the ID.
703    <br><br>
704    <li> <b>assign_functions</b> is a 2-column table.  The ID is in column 1, and column 2 contains the
705    gene function (often called a <i>product name</i>) asserted by the external source.
706    </ol>
707    <br>
708    You should proceed only when you have updated as many of the sources as you wish.
709    <br><br>
710    <li> Now run
711    <pre>
712           import_external_sequences_step1
713    </pre>
714    
715    This program will build a new nonredundant database, check to see what has changed, and will
716    build the input required to compute new similarities.
717    <br><br>
718    <li> Compute the needed similarities
719    
720    You will need three files to compute a new batch of similarities.  The locations of these
721    three files are displayed by <b>import_external_sequences_step1</b> just before completion
722    (i.e., you should have gotten them as the output of the last step).  Compute the similarities (see
723    the discussion below) and store them in the <b>NewSims</b> directory (again the precise location
724    was displayed by <b>import_external_sequences_step1</b>).
725    <br><br>
726    <li> Run
727    <pre>
728           import_external_sequences_step3
729    </pre>
730    </ol>
731    
732  <h2 id="sims">Computing Similarities</h2>  <h2 id="sims">Computing Similarities</h2>
733    
734  Adding a genome does not automatically get similarities computed for the new genome; it queues the request.  Adding a genome does not automatically get similarities computed for the new genome.
735  To get the similarities actually computed, you need to establish a computational environment on which  To get the similarities actually computed, you need to compute them and make them available in
736  the blast runs will be made, and then initiate a request on the machine running the SEED.  the <b>FIGdisk/FIG/Data/NewSims</b> directory.
737  <p>  <p>
738  This is not a completely trivial process because there are a variety of different ways to compute  To compute similarities, you will need to do the following:
 similarities:  
739  <ol>  <ol>
740  <li> You can just compute them on the system running the SEED.  This can take several days, but this  <li>The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in
741  is often a perfectly reasonable way to get the job done.  <b>~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta</b>.  A copy of this was appended to
742  <li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),  <b>~fig/FIGdisk/FIG/Data/Global/nr</b> when your genome was added.  <b>nr</b> is the "nonredundant database"
743  and you wish to just exploit these machines to do the blast runs.  we use to compute similarities (and the one you must use).  To get the initial blast results, you would use something
744  <li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).  like
745  In this case, it makes sense to utilize a large computational resource, and this resource may either  <br>
746  be a local cluster or a service provided over the net.  <pre>
747              blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
748    </pre>
749    <br>
750    which produces the blast results in a tab-separated format.  The invocation of <b>reduce_sims</b> is optional.
751    It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
752    <li>
753    The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences.  To add that, you would use
754    <br>
755    <pre>
756            reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
757    </pre>
758    <br>
759    This will actually append two columns to each similarity and place the results in the <b>NewSims</b>
760    directory where it should be.
761  </ol>  </ol>
762    <p>
763    The above description will produce similarities using a single invocation of
764    blastall.  For most large genomes, and whenever you wish to process a batch of genomes,
765    you should use parallel processing while maintaining the spirit of the approach.
766    No matter how you produce the new similarities, they need to be added
767    as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you
768    need to index these similarities using
769    <pre>
770            index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
771    </pre>
772    where XXXX is the file you added.  If you have more than one such
773    file, just put in several arguments for the command.  This will
774    "index" the similarities in that any of the new PEGs which have
775    similarities connecting them to other PEGs from the existing genomes
776    can now be displayed.  However, the connection from the existing
777    genomes to the new PEGs does not yet exist (we call these the "flips"
778    of the computed sims).  To get this ability, you need to go through a
779    process that will make your system unavailable for a period (and, it
780    will produce a substantial load on your system for a day or so, while
781    the SEED sorts, sifts, inserts, and generally plays with the "flips").
782  <br>  <br>
783  To establish the flexibility needed to support all of these alternatives, we implemented the following  The extra steps you need to take to make a fully functional version
784  approach:  are as follows:
785  <ul>  <ol>
786  <li>  <li>
787  The user can describe one or more <b>similarity computational environments</b>  First, you need to run
788  in a configuration file called <i>similarities.config</i>.  The details of this encoding  <pre>
789  are beyond the scope of this document.          update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
790  These environments all represent potential ways to compute similarities.  </pre>
791  <br>  This should produce updated similarity files in a VERY BIG directory
792  <li>  that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
793  When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,  put anywhere).  This may run as much as a day or so (and you can watch
794  he runs a program specifying a specific similarity computational environment.  This causes all  its progress as it updates the similarity files).
795  the queued similarity requests to be batched up and sent off to the specified server (which may simply  <li>The next step is to replace the existing similarity files with the
796  be on the same machine).  He would use the <b>generate_similarities</b> command specifying two parameters: the  newly computed ones.  You need to make the SEED unavailable (via the
797  first specifies a similarities computational environment, and the second specifies whether or not automated assignments  <b>SEED Control Panel</b>.
798  should be computed as the similarity computations complete and the results are installed.  <li>Then, blow away the existing similarities using something like
799  As the similarities complete, they will automatically be installed.  Further, if a set of similarities arrive  <pre>
800  for a given protein-encoding gene, and if there is no current assignment of function for the gene,          rm ~/FIGdisk/FIG/Data/Sims/*
801  an automated assignment may be computed.  Whether or not such automated assignments are computed is determined          rm ~/FIGdisk/FIG/Data/NewSims/*
802  by the second parameter in the command used by the systems administrator to initiate the request.  For example,          cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
803  <pre>          rm -r ~/Tmp/FlippedSims
804          generate_similarities local auto-assignments  </pre>
805  </pre>  There are several ways to do this.  You might want to save the old
806  specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast  similarities somewhere.  You might be able to move (rather than copy),
807  requests on this machine", and requests automated assignments for all protein-encoding genes that currently either  the similarities.  Whatever suits you.
808  have no assigned function or have an assigned function that is "hypothetical".  <li> Then run
809  </ul>  <pre>
810            index_sims
811    </pre>
812    to re-index all of the similarities, and you should be fully
813    operational.
814    </ol>
815  <br>  <br>
816    
 We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined  
 interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up  
 your configuration to exploit these resources.  
   
817  <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>  <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>
818    
819  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
# Line 678  Line 823 
823  <p>  <p>
824  To delete a set of genomes from a running version of the SEED, just use  To delete a set of genomes from a running version of the SEED, just use
825  <pre>  <pre>
826          fig delete_genomes G1 G2 ...Gn  (where G1 G2 ... Gn designates a list of genomes)          fig mark_deleted_genomes User G1 G2 ...Gn  (where G1 G2 ... Gn designates a list of genomes)
827  </pre>  </pre>
828  For example,  For example,
829  <pre>  <pre>
830          fig delete_genomes 562.1          fig mark_deleted_genomes RossO 562.1
831  </pre>  </pre>
832  could be used to delete a single genome with a genome ID of 562.1.  could be used to delete a single genome with a genome ID of 562.1.
 <p>  
 To make a copy with some genomes deleted to give to someone else requires a little different approach.  
 To extract a set of genomes from an existing version of the SEED, you need to run the command  
 <pre>  
         extract_genomes Which ExistingData ExtractedData  
 </pre>  
   
 The first argument is either the word "unrestricted" or the name of a file containing a list of  
 genome IDs (the genomes that are to be retained in the extraction).  The second argument is  
 the path to the current Data directory.  The third argument specifies the name of a directory  
 that is created holding the extraction.  Thus,  
 <pre>  
         extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData  
 </pre>  
 would created the extracted Data directory for you.  If you wish to then produce a fully distributable  
 version of the SEED from the existing version and the extracted Data directory, you would  
 use  
 <pre>  
         make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo  
         rm -rf /Volumes/Tmp/ExtractedData  
 </pre>  
833    
834   <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>   <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>
835    
# Line 740  Line 864 
864          compute_pins_and_clusters 562.4          compute_pins_and_clusters 562.4
865  </pre>  </pre>
866  would compute and add entries for all of the <i>pegs</i> in genome 562.4.  would compute and add entries for all of the <i>pegs</i> in genome 562.4.
867    
868    <h2 id="auto_annotation">
869       Automatic Annotation of Genomes
870    </h2>
871    The SEED provides a simple but limited capability for automated assignment
872    of protein-encoding gene function based on similarity.
873    Candidate functions are assigned scores based on the combined strengths
874    of all BLASTP similarities to genes carrying that particular assignment,
875    weighted by the provenance and assignment-confidence for each similar gene.
876    The final automated function assignment is then determined from the
877    list of candidate functions and their associated scores.
878    
879    Automated assignment is a four-step process:
880    <ol>
881    <li> Create a list of PEGs to be automatically assigned.
882    If one wishes to make assignments to an entire organism or set of organisms
883    that are already installed in the SEED, the simplest method for creating
884    this list is to type the following command:
885    <pre>
886        pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
887    </pre>
888    
889    <p>
890    <li> Next, create a list of candidate function-assignments using the following
891    command:
892    <pre>
893       auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
894    </pre>
895    (NOTE: The `auto_assign` command has some additional optional parameters;
896    for example, if one knows that all the PEGs in 'peg.list' are from
897    prokaryotic organisms, one can make use of this additional informaation
898    by invoking `auto_assign` as follows:
899    <pre>
900       auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
901    </pre>
902    Also, if one wishes to use an alternate file of similarity data named 'simfile'
903    instead of the precomputed similarities stored in the SEED, one can instead type:
904    <pre>
905       auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
906    </pre>
907    Finally, `auto_assign` can read a set of alternate parameters from a file,
908    but we recommend that you stick with the default settings, and not exploit this
909    last feature unless you are a qualified SEED wizard.)
910    <p>
911    
912    <li> Next, create a SEED format assigned-functions file as follows:
913    <pre>
914        make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
915    </pre>
916    Alternately, if you wish to suppress the class of "non-informative" function assignments
917    such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
918    you may do so using the '-no_hypos' flag:
919    <pre>
920        make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
921    </pre>
922    
923    <li> Finally, install the automated assignments in the seed using the command
924    <pre>
925        fig assign_functionF master:automated_assignments  ~/Tmp/assigned_functions
926    </pre>
927    
928    </ol>
929    
930    It should be once again noted that the SEED's automated assignment algorithm
931    is quite simple and crude, being only slightly better than simply assigning
932    the function of the highest-scoring BLASTP hit; however, it at least provides
933    a "quick and dirty" starting point for making an initial assessment of a genome,
934    which may then be cleaned up and refined by skilled genome annotators.
935    
936    
937    
938    
939    
940    

Legend:
Removed from v.1.9  
changed lines
  Added in v.1.18

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3