[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Diff of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.11, Fri Jul 30 22:01:21 2004 UTC revision 1.16, Tue Jan 25 23:01:46 2005 UTC
# Line 39  Line 39 
39      Computing "Pins" and "Clusters"      Computing "Pins" and "Clusters"
40  </A>  </A>
41    
42    <li><A HREF="#auto_annotation">
43        Automatic Annotation of Genomes
44    </A>
45    
46  </ul>  </ul>
47    
48    
# Line 90  Line 94 
94  /Volumes/Backup is a backup disk.  Then,  /Volumes/Backup is a backup disk.  Then,
95  <br>  <br>
96  <pre>  <pre>
97          cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup          cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
98          gzip -r /Volumes/Backup/Data.Backup          gzip -r /Volumes/Backup/Data.Backup
99  </pre>  </pre>
100  <br>  <br>
# Line 657  Line 661 
661    
662  <h2 id="sims">Computing Similarities</h2>  <h2 id="sims">Computing Similarities</h2>
663    
664  Adding a genome does not automatically get similarities computed for the new genome; it queues the request.  Adding a genome does not automatically get similarities computed for the new genome.
665  To get the similarities actually computed, you need to establish a computational environment on which  To get the similarities actually computed, you need to compute them and make them available in
666  the blast runs will be made, and then initiate a request on the machine running the SEED.  the <b>FIGdisk/FIG/Data/NewSims</b> directory.
667  <p>  <p>
668  This is not a completely trivial process because there are a variety of different ways to compute  To compute similarities, you will need to do the following:
 similarities:  
669  <ol>  <ol>
670  <li> You can just compute them on the system running the SEED.  This can take several days, but this  <li>The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in
671  is often a perfectly reasonable way to get the job done.  <b>~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta</b>.  A copy of this was appended to
672  <li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),  <b>~fig/FIGdisk/FIG/Data/Global/nr</b> when your genome was added.  <b>nr</b> is the "nonredundant database"
673  and you wish to just exploit these machines to do the blast runs.  we use to compute similarities (and the one you must use).  To get the initial blast results, you would use something
674  <li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).  like
675  In this case, it makes sense to utilize a large computational resource, and this resource may either  <br>
676  be a local cluster or a service provided over the net.  <pre>
677              blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
678    </pre>
679    <br>
680    which produces the blast results in a tab-separated format.  The invocation of <b>reduce_sims</b> is optional.
681    It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
682    <li>
683    The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences.  To add that, you would use
684    <br>
685    <pre>
686            reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
687    </pre>
688    <br>
689    This will actually append two columns to each similarity and place the results in the <b>NewSims</b>
690    directory where it should be.
691  </ol>  </ol>
692    <p>
693    The above description will produce similarities using a single invocation of
694    blastall.  For most large genomes, and whenever you wish to process a batch of genomes,
695    you should use parallel processing while maintaining the spirit of the approach.
696    No matter how you produce the new similarities, they need to be added
697    as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you
698    need to index these similarities using
699    <pre>
700            index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
701    </pre>
702    where XXXX is the file you added.  If you have more than one such
703    file, just put in several arguments for the command.  This will
704    "index" the similarities in that any of the new PEGs which have
705    similarities connecting them to other PEGs from the existing genomes
706    can now be displayed.  However, the connection from the existing
707    genomes to the new PEGs does not yet exist (we call these the "flips"
708    of the computed sims).  To get this ability, you need to go through a
709    process that will make your system unavailable for a period (and, it
710    will produce a substantial load on your system for a day or so, while
711    the SEED sorts, sifts, inserts, and generally plays with the "flips").
712  <br>  <br>
713  To establish the flexibility needed to support all of these alternatives, we implemented the following  The extra steps you need to take to make a fully functional version
714  approach:  are as follows:
715  <ul>  <ol>
716  <li>  <li>
717  The user can describe one or more <b>similarity computational environments</b>  First, you need to run
718  in a configuration file called <i>similarities.config</i>.  The details of this encoding  <pre>
719  are beyond the scope of this document.          update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
720  These environments all represent potential ways to compute similarities.  </pre>
721  <br>  This should produce updated similarity files in a VERY BIG directory
722  <li>  that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
723  When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,  put anywhere).  This may run as much as a day or so (and you can watch
724  he runs a program specifying a specific similarity computational environment.  This causes all  its progress as it updates the similarity files).
725  the queued similarity requests to be batched up and sent off to the specified server (which may simply  <li>The next step is to replace the existing similarity files with the
726  be on the same machine).  He would use the <b>generate_similarities</b> command specifying two parameters: the  newly computed ones.  You need to make the SEED unavailable (via the
727  first specifies a similarities computational environment, and the second specifies whether or not automated assignments  <b>SEED Control Panel</b>.
728  should be computed as the similarity computations complete and the results are installed.  <li>Then, blow away the existing similarities using something like
729  As the similarities complete, they will automatically be installed.  Further, if a set of similarities arrive  <pre>
730  for a given protein-encoding gene, and if there is no current assignment of function for the gene,          rm ~/FIGdisk/FIG/Data/Sims/*
731  an automated assignment may be computed.  Whether or not such automated assignments are computed is determined          rm ~/FIGdisk/FIG/Data/NewSims/*
732  by the second parameter in the command used by the systems administrator to initiate the request.  For example,          cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
733  <pre>          rm -r ~/Tmp/FlippedSims
734          generate_similarities local auto-assignments  </pre>
735  </pre>  There are several ways to do this.  You might want to save the old
736  specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast  similarities somewhere.  You might be able to move (rather than copy),
737  requests on this machine", and requests automated assignments for all protein-encoding genes that currently either  the similarities.  Whatever suits you.
738  have no assigned function or have an assigned function that is "hypothetical".  <li> Then run
739  </ul>  <pre>
740            index_sims
741    </pre>
742    to re-index all of the similarities, and you should be fully
743    operational.
744    </ol>
745  <br>  <br>
746    
 We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined  
 interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up  
 your configuration to exploit these resources.  
   
747  <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>  <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>
748    
749  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
# Line 734  Line 772 
772  the path to the current Data directory.  The third argument specifies the name of a directory  the path to the current Data directory.  The third argument specifies the name of a directory
773  that is created holding the extraction.  Thus,  that is created holding the extraction.  Thus,
774  <pre>  <pre>
775          extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData          extract_genomes unrestricted ~/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
776  </pre>  </pre>
777  would created the extracted Data directory for you.  If you wish to then produce a fully distributable  would created the extracted Data directory for you.  If you wish to then produce a fully distributable
778  version of the SEED from the existing version and the extracted Data directory, you would  version of the SEED from the existing version and the extracted Data directory, you would
779  use  use
780  <pre>  <pre>
781          make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo          make_a_SEED ~/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
782          rm -rf /Volumes/Tmp/ExtractedData          rm -rf /Volumes/Tmp/ExtractedData
783  </pre>  </pre>
784    
# Line 777  Line 815 
815          compute_pins_and_clusters 562.4          compute_pins_and_clusters 562.4
816  </pre>  </pre>
817  would compute and add entries for all of the <i>pegs</i> in genome 562.4.  would compute and add entries for all of the <i>pegs</i> in genome 562.4.
818    
819    <h2 id="auto_annotation">
820       Automatic Annotation of Genomes
821    </h2>
822    The SEED provides a simple but limited capability for automated assignment
823    of protein-encoding gene function based on similarity.
824    Candidate functions are assigned scores based on the combined strengths
825    of all BLASTP similarities to genes carrying that particular assignment,
826    weighted by the provenance and assignment-confidence for each similar gene.
827    The final automated function assignment is then determined from the
828    list of candidate functions and their associated scores.
829    
830    Automated assignment is a four-step process:
831    <ol>
832    <li> Create a list of PEGs to be automatically assigned.
833    If one wishes to make assignments to an entire organism or set of organisms
834    that are already installed in the SEED, the simplest method for creating
835    this list is to type the following command:
836    <pre>
837        pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
838    </pre>
839    
840    <p>
841    <li> Next, create a list of candidate function-assignments using the following
842    command:
843    <pre>
844       auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
845    </pre>
846    (NOTE: The `auto_assign` command has some additional optional parameters;
847    for example, if one knows that all the PEGs in 'peg.list' are from
848    prokaryotic organisms, one can make use of this additional informaation
849    by invoking `auto_assign` as follows:
850    <pre>
851       auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
852    </pre>
853    Also, if one wishes to use an alternate file of similarity data named 'simfile'
854    instead of the precomputed similarities stored in the SEED, one can instead type:
855    <pre>
856       auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
857    </pre>
858    Finally, `auto_assign` can read a set of alternate parameters from a file,
859    but we recommend that you stick with the default settings, and not exploit this
860    last feature unless you are a qualified SEED wizard.)
861    <p>
862    
863    <li> Next, create a SEED format assigned-functions file as follows:
864    <pre>
865        make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
866    </pre>
867    Alternately, if you wish to suppress the class of "non-informative" function assignments
868    such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
869    you may do so using the '-no_hypos' flag:
870    <pre>
871        make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
872    </pre>
873    
874    <li> Finally, install the automated assignments in the seed using the command
875    <pre>
876        fig auto_assignF ~/Tmp/assigned_functions
877    </pre>
878    
879    </ol>
880    
881    It should be once again noted that the SEED's automated assignment algorithm
882    is quite simple and crude, being only slightly better than simply assigning
883    the function of the highest-scoring BLASTP hit; however, it at least provides
884    a "quick and dirty" starting point for making an initial assessment of a genome,
885    which may then be clraned up and refined by skilled genome annotators.
886    
887    
888    
889    
890    
891    

Legend:
Removed from v.1.11  
changed lines
  Added in v.1.16

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3