[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

Diff of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.4, Wed Jul 21 16:59:51 2004 UTC revision 1.16, Tue Jan 25 23:01:46 2005 UTC
# Line 1  Line 1 
1  <h1>SEED Administration</h1>  <h1>SEED Administration</h1>
2  <p>This tutorial discusses a number of issues that you will need to know about  
3    in order to install, share, and maintain your SEED installation.</p>  <p>
4  <h2>Backing Up Your Data</h2>  This tutorial discusses a number of issues that you will need to know about
5    in order to install, share, and maintain your SEED installation.
6    It is organized as follows:
7    </p>
8    
9    <ul>
10    <li><A HREF="#backups">
11         Backing Up Your Data
12    </A>
13    
14    <li><A HREF="#copying">
15         Copying a Version of the SEED
16    </A>
17    
18    <li><A HREF="#multiple_copies">
19         Running Multiple Copies of the SEED
20    </A>
21    
22    <li><A HREF="#adding_genomes">
23    Adding a New Genome to an Existing SEED
24    </A>
25    
26    <li><A HREF="#sims">
27        Computing Similarities
28    </A>
29    
30    <li><A HREF="#deleting_genomes">
31        Deleting Genomes from a Version of the SEED
32    </A>
33    
34    <li><A HREF="#reintegrate_sims">
35        Periodic Reintegration of Similarities
36    </A>
37    
38    <li><A HREF="#pins_and_clusters">
39        Computing "Pins" and "Clusters"
40    </A>
41    
42    <li><A HREF="#auto_annotation">
43        Automatic Annotation of Genomes
44    </A>
45    
46    </ul>
47    
48    
49    <h2 id="backups">Backing Up Your Data</h2>
50  The data and code stored within the SEED are organized as follows:  The data and code stored within the SEED are organized as follows:
51  <pre>  <pre>
52          ~fig                                 on a Mac: /Users/fig; on Linux: /home/fig          ~fig                                 on a Mac: /Users/fig; on Linux: /home/fig
# Line 49  Line 94 
94  /Volumes/Backup is a backup disk.  Then,  /Volumes/Backup is a backup disk.  Then,
95  <br>  <br>
96  <pre>  <pre>
97          cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup          cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
98          gzip -r /Volumes/Backup/Data.Backup          gzip -r /Volumes/Backup/Data.Backup
99  </pre>  </pre>
100  <br>  <br>
101  would be a reasonable way to make a backup.  The copy preserves  would be a reasonable way to make a backup.  The copy preserves
102  permissions, copies recursively, and does not follow symbolic links.  permissions, copies recursively, and does not follow symbolic links.
103  <br>  <br>
104  <h2>Copying a Version of the SEED</h2>  <h2 id="copying">Copying a Version of the SEED</h2>
105    
106  To make a second copy of the SEED (either for a friend or for yourself), you should use tar  To make a second copy of the SEED (either for a friend or for yourself), you should use tar
107  to preserve a few symbolic links (which are relative, not absolute; this means that they can  to preserve a few symbolic links (which are relative, not absolute; this means that they can
# Line 156  Line 201 
201  <blockquote>  <blockquote>
202    <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions">      http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>    <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions">      http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
203  </blockquote>  </blockquote>
204  <h2>Running Multiple Copies of the SEED</h2>  <h2 id="multiple_copies">Running Multiple Copies of the SEED</h2>
205    
206  For individual users that use the SEED to support comparative analysis, a single copy is completely  For individual users that use the SEED to support comparative analysis, a single copy is completely
207  adequate.  Adding genomes can usually be done without disrupting normal use, and a very occasional major  adequate.  Adding genomes can usually be done without disrupting normal use, and a very occasional major
# Line 166  Line 211 
211  effort.  In this case, you have a user community that is sensitive to disruptions of service, and you  effort.  In this case, you have a user community that is sensitive to disruptions of service, and you
212  have frequent demands to update versions of data.  In this case, it is best to have two systems: the  have frequent demands to update versions of data.  In this case, it is best to have two systems: the
213  <b>production system</b> is used to support the larger user community, and the <b>update system</b> is  <b>production system</b> is used to support the larger user community, and the <b>update system</b> is
214  used to prepare updated versions of the system.  Even so, work stoppages of 4-8 hours will occur when  used to prepare updated versions of the system.
215  new releases are swapped in.  To swap in new data from the update system to the production system,  New genomes are added to the update system, and then periodically a
216  you need to  revised Data directory is extracted to update the production system.
217    Even so, work stoppages of a few hours will occur when
218    new releases are swapped in.
219    <p>
220    This use of an "update" and a "production" system is quite analogous
221    to running a production system which is occasionally updated from new
222    Data DVDs (which FIG normally makes available about every 4-6 months).
223    That is, in both cases you are updating a production system from a
224    newly created <b>Data</b> directory that is lacking assignments and
225    annotations that exist on your production system.  However, if you have
226    added new genomes to the production system (that are not part of the
227    releases you may acquire via DVDs), you should get the new release,
228    install the versions of your local genomes, and then do this update
229    procedure.
230    <p>
231    The plan we propose is to build a completely encapsulated new version
232    of the system, then capture updates from the old production system, update
233    the new production system, and then make the new version the actual
234    production system.  This last step amounts to altering a symbolic link
235    to point at the new production system rather than the old.  This has
236    the virtue of ease of recovery -- that is, if something goes wrong you
237    can flip back to the old system.
238    The actual steps are as follows:
239  <ol>  <ol>
240  <li>stop all work on the production machine,  
241    <li> First, make sure that you are in the BASH shell by typing "echo $SHELL";
242       if the result is not "bash", type "bash" to enter the BASH shell.
243    <p>
244    
245    <li> Next, check that the result of typing "which perl" is the version
246       of perl owned by the SEED; it should look something like
247       <pre>
248           /Users/fig/FIGdisk/env/mac/bin/perl
249       </pre>
250       although the exact results will depend on where your existing copy
251       of the SEED is installed, whether your platform is a Macintosh or LINUX,
252       etc. If the result does not look similar to the above, type:
253       <pre>
254           source Path_to_FIGdisk/config/fig-user-env.sh
255       </pre>
256       to setup your FIG environment properly.
257    <p>
258    
259    <li> Next, make a copy of the Code Distribution Environment (from a DVD
260    or via the network).  Suppose that we have made such a directory in
261    CodeDistEnv.  Then use,
262    <pre>
263            cd CodeDistEnv
264            ./install-code TargetDirectory
265    </pre>
266    where <b>TargetDirectory</b> is where you wish to build the new
267    production version.  We recommend calling it something like
268    <b>FIGdisk.July24</b>.
269    <p>
270    
271    <li> Stop all work on the production machine for the duration of the update.
272         You do this by clicking on the "Seed Control Panel" link,
273         and then entering an explanatory message in the text box
274         and clicking on the "Disable SEED server" button.
275    <p>
276    
277  <li>You now need to capture the assignments, annotations and  <li>You now need to capture the assignments, annotations and
278  subsystems work that has been done on the production machine.  To do       subsystems work that has been done on the production machine.
279  this, you need to know when the last production release was       To do this, you need to know when the last production release
280  installed.  Suppose that it was July 1, 2004.  If that was the date,       was installed.  Suppose that it was July 1, 2004.
281  we recommend that you       If that was the date, we recommend that you run
 run<br><br>  
282  <pre>  <pre>
283      <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004<</b>          <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004</b>
284  </pre>  </pre>
285  <br><br>  
286  This will capture your updates and save them in the directory  This will capture your updates and save them in the directory
287  /tmp/sync.data.july.1.2004.       /tmp/sync.data.july.1.2004.<br>
288  <li>Now, you need to replace your <b>Data</b> directory (within  <p>
289  <b>FIGdisk/FIG</b>) with the new version from the update system.  We  
290  suggest that you do the following:  <li>Now, you need to stop the existing production system using
291  <ol>  <pre>
292  <li>archive the existing <b>Data</b> directory.  These can usually be          ~/FIGdisk/bin/stop-servers
293  discarded within a month or two, but keeping them around is a good  </pre>
294  safety measure.  <p>
295  <li>move a copy of the update <b>Data</b> directory into the  
296  <b>FIGdisk/FIG</b> directory.  <li>Now, you need to configure the runtime environment for the system
297  </ol>  you are running on.
298  At this point, you have a version of the data from the update system  To do this, use
299  in the right location, but the internal databases all contain the old data.  <pre>
300  <li> Now, run          cd TargetDirectory
301            ./configure MacOrLinux
302    </pre>
303    where <b>MacOrLinux</b> must be a currently supported environment.
304    Those that are supported on July 24, 2004 are <b>mac</b> for
305    Macintoshes running panther, <b>mac-jaguar</b> for those that have not
306    upgraded to panther, and <b>linux-postgres</b>.
307    <p>
308    
309    <li>Now, you need to insert the new Data directory into the newly
310    constructed version of the SEED.  To do this use
311    <pre>
312            chmod -R 777 TheNewData
313            cd TargetDirectory/FIG
314            ln -s TheNewData Data
315    </pre>
316    where TheNewData is the new Data directory, which normally comes  from the
317    update system.  If you acquired a new Data directory via Data DVDs, you
318    will need to unpack them using the README instructions, but what
319    results is a new version of the <b>Data</b> directory.
320    <p>
321    
322    <li>Now, you need to start the servers in order to load the databases
323    with the new release using
324    <pre>
325            cd TargetDirectory/bin
326            ./start-servers
327            cd ..
328            source config/fig-user-env.sh
329            init_FIG
330            fig load_all
331    </pre>
332    This last command will run for several hours.
333    <p>
334    
335    (<b>WARNING:</b> Please note that, because the new SEED's databases
336    do not yet exist, the `init_FIG` command will generate two totally
337    harmless but rather terrrifying error messages the very first time it is executed,
338    so that its output will look something like this:
339    
340  <pre>  <pre>
341          <b>fig load_all</b>  DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL:  Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
342    
343    Initializing new SEED database fig
344    
345    ERROR:  DROP DATABASE: database "fig" does not exist
346    dropdb: database removal failed
347    CREATE DATABASE
348    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
349    CREATE TABLE
350    
351    Complete. You will need to run "fig load_all" to load the data.
352  </pre>  </pre>
353  to reload the production databases with the data from the newly inserted Data directory.  We recognize that that generating the above two faux "FATAL" errors
354  This will usually take several hours.  constitutes a rather ugly and inelegant implementation,
355    but we have not yet found a more elegant database initialization method
356    that can avoid generating them.)
357    <p>
358    
359  <li>Now, you need to capture the changes made to the old production  <li>Now, you need to capture the changes made to the old production
360  version using something like  version using something like
 <br>  
361  <pre>  <pre>
362          <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>          <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
363  </pre>  </pre>
364  <br>  <p>
365  <li> make the production machine available for use.  
366    <li>Run
367        <pre>
368            index_annotations
369            index_subsystems
370            make_indexes
371        </pre>
372    <p>
373    
374    <li> Now, finally, you should alter the symbolic link in <i>~fig</i> to
375    the current FIGdisk using something like:
376    <pre>
377            cd ~fig
378            rm FIGdisk     # should be removing a symbolic link to the current SEED
379            ln -s TargetDirectory FIGdisk
380    </pre>
381    That should make the new SEED the one available through the Web interface.
382    <p>
383    
384  <li>You should now bring your update system to the same state as the  <li>You should now bring your update system to the same state as the
385  production system.  This can be done by making sure that  production system.  This can be done by making sure that
386  <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.  <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.
# Line 222  Line 394 
394  <br>  <br>
395  on the update machine.  on the update machine.
396  </ol>  </ol>
397    <p>
398    
399  Our experience is that anytime a group wishes to share a common production environment,  Our experience is that anytime a group wishes to share a common production environment,
400  this 2-system approach is the way to do it.  You can, if necessary,  this 2-system approach is the way to do it.  You can, if necessary,
401  put both systems on the same physical machine.  This does require some  put both systems on the same physical machine.  This does require some
# Line 233  Line 407 
407  desirable to spend a little more and get at least 1 gigabyte of main  desirable to spend a little more and get at least 1 gigabyte of main
408  memory and 200 gigabytes of external disk.  memory and 200 gigabytes of external disk.
409  <br>  <br>
410  <h2>Adding a New Genome to an Existing SEED</h2>  <h2 id="adding_genomes">Adding a New Genome to an Existing SEED</h2>
411  To add a new genome to a running SEED is fairly easy, but there are a  To add a new genome to a running SEED is fairly easy, but there are a
412  number of details that do have to be handled with care.  number of details that do have to be handled with care.
413  <p>  <p>
# Line 485  Line 659 
659  The <i>add_genome</i> request will add your new genome and queue a computational request that similarities  The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
660  be computed for the protein-encoding genes.  be computed for the protein-encoding genes.
661    
662  <h2>Computing Similarities</h2>  <h2 id="sims">Computing Similarities</h2>
663    
664  Adding a genome does not automatically get similarities computed for the new genome; it queues the request.  Adding a genome does not automatically get similarities computed for the new genome.
665  To get the similarities actually computed, you need to establish a computational environment on which  To get the similarities actually computed, you need to compute them and make them available in
666  the blast runs will be made, and then initiate a request on the machine running the SEED.  the <b>FIGdisk/FIG/Data/NewSims</b> directory.
667  <p>  <p>
668  This is not a completely trivial process because there are a variety of different ways to compute  To compute similarities, you will need to do the following:
 similarities:  
669  <ol>  <ol>
670  <li> You can just compute them on the system running the SEED.  This can take several days, but this  <li>The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in
671  is often a perfectly reasonable way to get the job done.  <b>~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta</b>.  A copy of this was appended to
672  <li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),  <b>~fig/FIGdisk/FIG/Data/Global/nr</b> when your genome was added.  <b>nr</b> is the "nonredundant database"
673  and you wish to just exploit these machines to do the blast runs.  we use to compute similarities (and the one you must use).  To get the initial blast results, you would use something
674  <li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).  like
675  In this case, it makes sense to utilize a large computational resource, and this resource may either  <br>
676  be a local cluster or a service provided over the net.  <pre>
677              blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
678    </pre>
679    <br>
680    which produces the blast results in a tab-separated format.  The invocation of <b>reduce_sims</b> is optional.
681    It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
682    <li>
683    The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences.  To add that, you would use
684    <br>
685    <pre>
686            reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
687    </pre>
688    <br>
689    This will actually append two columns to each similarity and place the results in the <b>NewSims</b>
690    directory where it should be.
691  </ol>  </ol>
692    <p>
693    The above description will produce similarities using a single invocation of
694    blastall.  For most large genomes, and whenever you wish to process a batch of genomes,
695    you should use parallel processing while maintaining the spirit of the approach.
696    No matter how you produce the new similarities, they need to be added
697    as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you
698    need to index these similarities using
699    <pre>
700            index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
701    </pre>
702    where XXXX is the file you added.  If you have more than one such
703    file, just put in several arguments for the command.  This will
704    "index" the similarities in that any of the new PEGs which have
705    similarities connecting them to other PEGs from the existing genomes
706    can now be displayed.  However, the connection from the existing
707    genomes to the new PEGs does not yet exist (we call these the "flips"
708    of the computed sims).  To get this ability, you need to go through a
709    process that will make your system unavailable for a period (and, it
710    will produce a substantial load on your system for a day or so, while
711    the SEED sorts, sifts, inserts, and generally plays with the "flips").
712  <br>  <br>
713  To establish the flexibility needed to support all of these alternatives, we implemented the following  The extra steps you need to take to make a fully functional version
714  approach:  are as follows:
715  <ul>  <ol>
716  <li>  <li>
717  The user can describe one or more <b>similarity computational environments</b>  First, you need to run
718  in a configuration file called <i>similarities.config</i>.  The details of this encoding  <pre>
719  are beyond the scope of this document.          update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
720  These environments all represent potential ways to compute similarities.  </pre>
721  <br>  This should produce updated similarity files in a VERY BIG directory
722  <li>  that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
723  When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,  put anywhere).  This may run as much as a day or so (and you can watch
724  he runs a program specifying a specific similarity computational environment.  This causes all  its progress as it updates the similarity files).
725  the queued similarity requests to be batched up and sent off to the specified server (which may simply  <li>The next step is to replace the existing similarity files with the
726  be on the same machine).  He would use the <b>generate_similarities</b> command specifying two parameters: the  newly computed ones.  You need to make the SEED unavailable (via the
727  first specifies a similarities computational environment, and the second specifies whether or not automated assignments  <b>SEED Control Panel</b>.
728  should be computed as the similarity computations complete and the results are installed.  <li>Then, blow away the existing similarities using something like
729  As the similarities complete, they will automatically be installed.  Further, if a set of similarities arrive  <pre>
730  for a given protein-encoding gene, and if there is no current assignment of function for the gene,          rm ~/FIGdisk/FIG/Data/Sims/*
731  an automated assignment may be computed.  Whether or not such automated assignments are computed is determined          rm ~/FIGdisk/FIG/Data/NewSims/*
732  by the second parameter in the command used by the systems administrator to initiate the request.  For example,          cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
733  <pre>          rm -r ~/Tmp/FlippedSims
734          generate_similarities local auto-assignments  </pre>
735  </pre>  There are several ways to do this.  You might want to save the old
736  specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast  similarities somewhere.  You might be able to move (rather than copy),
737  requests on this machine", and requests automated assignments for all protein-encoding genes that currently either  the similarities.  Whatever suits you.
738  have no assigned function or have an assigned function that is "hypothetical".  <li> Then run
739  </ul>  <pre>
740            index_sims
741    </pre>
742    to re-index all of the similarities, and you should be fully
743    operational.
744    </ol>
745  <br>  <br>
746    
747  We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined  <h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>
 interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up  
 your configuration to exploit these resources.  
   
 <h2>Deleting Genomes from a Version of the SEED </h2>  
748    
749  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is  There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
750  when you wish to replace an existing version of a genome (in which case the replacement is viewed as first  when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
# Line 564  Line 772 
772  the path to the current Data directory.  The third argument specifies the name of a directory  the path to the current Data directory.  The third argument specifies the name of a directory
773  that is created holding the extraction.  Thus,  that is created holding the extraction.  Thus,
774  <pre>  <pre>
775          extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData          extract_genomes unrestricted ~/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
776  </pre>  </pre>
777  would created the extracted Data directory for you.  If you wish to then produce a fully distributable  would created the extracted Data directory for you.  If you wish to then produce a fully distributable
778  version of the SEED from the existing version and the extracted Data directory, you would  version of the SEED from the existing version and the extracted Data directory, you would
779  use  use
780  <pre>  <pre>
781          make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo          make_a_SEED ~/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
782          rm -rf /Volumes/Tmp/ExtractedData          rm -rf /Volumes/Tmp/ExtractedData
783  </pre>  </pre>
784    
785  <h2>Periodic Reintegration of Similarities</h2>   <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>
786    
787  When the initial SEED was constructed, similarities were computed.  For most similarities of the form  When the initial SEED was constructed, similarities were computed.  For most similarities of the form
788  "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2.  This is not always true,  "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2.  This is not always true,
# Line 592  Line 800 
800  </pre>  </pre>
801  The job will probably run for quite a while (perhaps as much as a day or two).  The job will probably run for quite a while (perhaps as much as a day or two).
802    
803  <h2>Computing "Pins" and "Clusters"</h2>  <h2 id="pins_and_clusters">Computing "Pins" and "Clusters"</h2>
804    
805  The SEED displays potentially significant clusters on prokaryotic chromosomes.  In the  The SEED displays potentially significant clusters on prokaryotic chromosomes.  In the
806  process of finding preserved contiguity, it computes "pins", which are simply a set of genes  process of finding preserved contiguity, it computes "pins", which are simply a set of genes
# Line 607  Line 815 
815          compute_pins_and_clusters 562.4          compute_pins_and_clusters 562.4
816  </pre>  </pre>
817  would compute and add entries for all of the <i>pegs</i> in genome 562.4.  would compute and add entries for all of the <i>pegs</i> in genome 562.4.
818    
819    <h2 id="auto_annotation">
820       Automatic Annotation of Genomes
821    </h2>
822    The SEED provides a simple but limited capability for automated assignment
823    of protein-encoding gene function based on similarity.
824    Candidate functions are assigned scores based on the combined strengths
825    of all BLASTP similarities to genes carrying that particular assignment,
826    weighted by the provenance and assignment-confidence for each similar gene.
827    The final automated function assignment is then determined from the
828    list of candidate functions and their associated scores.
829    
830    Automated assignment is a four-step process:
831    <ol>
832    <li> Create a list of PEGs to be automatically assigned.
833    If one wishes to make assignments to an entire organism or set of organisms
834    that are already installed in the SEED, the simplest method for creating
835    this list is to type the following command:
836    <pre>
837        pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
838    </pre>
839    
840    <p>
841    <li> Next, create a list of candidate function-assignments using the following
842    command:
843    <pre>
844       auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
845    </pre>
846    (NOTE: The `auto_assign` command has some additional optional parameters;
847    for example, if one knows that all the PEGs in 'peg.list' are from
848    prokaryotic organisms, one can make use of this additional informaation
849    by invoking `auto_assign` as follows:
850    <pre>
851       auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
852    </pre>
853    Also, if one wishes to use an alternate file of similarity data named 'simfile'
854    instead of the precomputed similarities stored in the SEED, one can instead type:
855    <pre>
856       auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
857    </pre>
858    Finally, `auto_assign` can read a set of alternate parameters from a file,
859    but we recommend that you stick with the default settings, and not exploit this
860    last feature unless you are a qualified SEED wizard.)
861    <p>
862    
863    <li> Next, create a SEED format assigned-functions file as follows:
864    <pre>
865        make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
866    </pre>
867    Alternately, if you wish to suppress the class of "non-informative" function assignments
868    such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
869    you may do so using the '-no_hypos' flag:
870    <pre>
871        make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
872    </pre>
873    
874    <li> Finally, install the automated assignments in the seed using the command
875    <pre>
876        fig auto_assignF ~/Tmp/assigned_functions
877    </pre>
878    
879    </ol>
880    
881    It should be once again noted that the SEED's automated assignment algorithm
882    is quite simple and crude, being only slightly better than simply assigning
883    the function of the highest-scoring BLASTP hit; however, it at least provides
884    a "quick and dirty" starting point for making an initial assessment of a genome,
885    which may then be clraned up and refined by skilled genome annotators.
886    
887    
888    
889    
890    
891    

Legend:
Removed from v.1.4  
changed lines
  Added in v.1.16

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3