[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Diff of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.12, Thu Aug 5 23:26:06 2004 UTC revision 1.16, Tue Jan 25 23:01:46 2005 UTC
# Line 39  Line 39 
39      Computing "Pins" and "Clusters"      Computing "Pins" and "Clusters"
40  </A>  </A>
41    
42    <li><A HREF="#auto_annotation">
43        Automatic Annotation of Genomes
44    </A>
45    
46  </ul>  </ul>
47    
48    
# Line 657  Line 661 
661    
662  <h2 id="sims">Computing Similarities</h2>  <h2 id="sims">Computing Similarities</h2>
663    
664  Adding a genome does not automatically get similarities computed for the new genome; it queues the request.  Adding a genome does not automatically get similarities computed for the new genome.
665  To get the similarities actually computed, you need to establish a computational environment on which  To get the similarities actually computed, you need to compute them and make them available in
666  the blast runs will be made, and then initiate a request on the machine running the SEED.  the <b>FIGdisk/FIG/Data/NewSims</b> directory.
667  <p>  <p>
668  This is not a completely trivial process because there are a variety of different ways to compute  To compute similarities, you will need to do the following:
 similarities:  
669  <ol>  <ol>
670  <li> You can just compute them on the system running the SEED.  This can take several days, but this  <li>The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in
671  is often a perfectly reasonable way to get the job done.  <b>~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta</b>.  A copy of this was appended to
672  <li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),  <b>~fig/FIGdisk/FIG/Data/Global/nr</b> when your genome was added.  <b>nr</b> is the "nonredundant database"
673  and you wish to just exploit these machines to do the blast runs.  we use to compute similarities (and the one you must use).  To get the initial blast results, you would use something
674  <li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).  like
 In this case, it makes sense to utilize a large computational resource, and this resource may either  
 be a local cluster or a service provided over the net.  
 </ol>  
675  <br>  <br>
676  To establish the flexibility needed to support all of these alternatives, we implemented the following  <pre>
677  approach:            blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
678  <ul>  </pre>
679    <br>
680    which produces the blast results in a tab-separated format.  The invocation of <b>reduce_sims</b> is optional.
681    It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
682  <li>  <li>
683  The user can describe one or more <b>similarity computational environments</b>  The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences.  To add that, you would use
 in a configuration file called <i>similarities.config</i>.  The details of this encoding  
 are beyond the scope of this document.  
 These environments all represent potential ways to compute similarities.  
 <br>  
 <li>  
 When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,  
 he runs a program specifying a specific similarity computational environment.  This causes all  
 the queued similarity requests to be batched up and sent off to the specified server (which may simply  
 be on the same machine).  He would use the <b>generate_similarities</b> command specifying two parameters: the  
 first specifies a similarities computational environment, and the second specifies whether or not automated assignments  
 should be computed as the similarity computations complete and the results are installed.  
 As the similarities complete, they will automatically be installed.  Further, if a set of similarities arrive  
 for a given protein-encoding gene, and if there is no current assignment of function for the gene,  
 an automated assignment may be computed.  Whether or not such automated assignments are computed is determined  
 by the second parameter in the command used by the systems administrator to initiate the request.  For example,  
 <pre>  
         generate_similarities local auto-assignments  
 </pre>  
 specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast  
 requests on this machine", and requests automated assignments for all protein-encoding genes that currently either  
 have no assigned function or have an assigned function that is "hypothetical".  
 </ul>  
684  <br>  <br>
685    <pre>
686  We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined          reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
687  interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up  </pre>
688  your configuration to exploit these resources.  <br>
689    This will actually append two columns to each similarity and place the results in the <b>NewSims</b>
690    directory where it should be.
691    </ol>
692  <p>  <p>
693    The above description will produce similarities using a single invocation of
694    blastall.  For most large genomes, and whenever you wish to process a batch of genomes,
695    you should use parallel processing while maintaining the spirit of the approach.
696  No matter how you produce the new similarities, they need to be added  No matter how you produce the new similarities, they need to be added
697  as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you  as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you
698  need to index these similarities using  need to index these similarities using
# Line 732  Line 719 
719          update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*          update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
720  </pre>  </pre>
721  This should produce updated similarity files in a VERY BIG directory  This should produce updated similarity files in a VERY BIG directory
722  that we happened to put at <i>~/Tmp/Flipped</i> (but, which you could  that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
723  put anywhere).  This may run as much as a day or so (and you can watch  put anywhere).  This may run as much as a day or so (and you can watch
724  its progress as it updates the similarity files).  its progress as it updates the similarity files).
725  <li>The next step is to replace the existing similarity files with the  <li>The next step is to replace the existing similarity files with the
# Line 828  Line 815 
815          compute_pins_and_clusters 562.4          compute_pins_and_clusters 562.4
816  </pre>  </pre>
817  would compute and add entries for all of the <i>pegs</i> in genome 562.4.  would compute and add entries for all of the <i>pegs</i> in genome 562.4.
818    
819    <h2 id="auto_annotation">
820       Automatic Annotation of Genomes
821    </h2>
822    The SEED provides a simple but limited capability for automated assignment
823    of protein-encoding gene function based on similarity.
824    Candidate functions are assigned scores based on the combined strengths
825    of all BLASTP similarities to genes carrying that particular assignment,
826    weighted by the provenance and assignment-confidence for each similar gene.
827    The final automated function assignment is then determined from the
828    list of candidate functions and their associated scores.
829    
830    Automated assignment is a four-step process:
831    <ol>
832    <li> Create a list of PEGs to be automatically assigned.
833    If one wishes to make assignments to an entire organism or set of organisms
834    that are already installed in the SEED, the simplest method for creating
835    this list is to type the following command:
836    <pre>
837        pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
838    </pre>
839    
840    <p>
841    <li> Next, create a list of candidate function-assignments using the following
842    command:
843    <pre>
844       auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
845    </pre>
846    (NOTE: The `auto_assign` command has some additional optional parameters;
847    for example, if one knows that all the PEGs in 'peg.list' are from
848    prokaryotic organisms, one can make use of this additional informaation
849    by invoking `auto_assign` as follows:
850    <pre>
851       auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
852    </pre>
853    Also, if one wishes to use an alternate file of similarity data named 'simfile'
854    instead of the precomputed similarities stored in the SEED, one can instead type:
855    <pre>
856       auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
857    </pre>
858    Finally, `auto_assign` can read a set of alternate parameters from a file,
859    but we recommend that you stick with the default settings, and not exploit this
860    last feature unless you are a qualified SEED wizard.)
861    <p>
862    
863    <li> Next, create a SEED format assigned-functions file as follows:
864    <pre>
865        make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
866    </pre>
867    Alternately, if you wish to suppress the class of "non-informative" function assignments
868    such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
869    you may do so using the '-no_hypos' flag:
870    <pre>
871        make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
872    </pre>
873    
874    <li> Finally, install the automated assignments in the seed using the command
875    <pre>
876        fig auto_assignF ~/Tmp/assigned_functions
877    </pre>
878    
879    </ol>
880    
881    It should be once again noted that the SEED's automated assignment algorithm
882    is quite simple and crude, being only slightly better than simply assigning
883    the function of the highest-scoring BLASTP hit; however, it at least provides
884    a "quick and dirty" starting point for making an initial assessment of a genome,
885    which may then be clraned up and refined by skilled genome annotators.
886    
887    
888    
889    
890    
891    

Legend:
Removed from v.1.12  
changed lines
  Added in v.1.16

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3