[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

View of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.18 - (download) (as text) (annotate)
Wed Jul 19 17:36:16 2006 UTC (13 years, 4 months ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.17: +2 -23 lines
minor fix to SEED admin document

<h1>SEED Administration</h1>

<p>
This tutorial discusses a number of issues that you will need to know about
in order to install, share, and maintain your SEED installation. 
It is organized as follows: 
</p>

<ul>
<li><A HREF="#backups">
     Backing Up Your Data
</A>

<li><A HREF="#copying">
     Copying a Version of the SEED
</A>

<li><A HREF="#multiple_copies">
     Running Multiple Copies of the SEED
</A>

<li><A HREF="#adding_genomes">
Adding a New Genome to an Existing SEED
</A>

<li><A HREF="#importing_external">
Importing External Protein Data
</A>

<li><A HREF="#sims">
    Computing Similarities
</A>

<li><A HREF="#deleting_genomes">
    Deleting Genomes from a Version of the SEED
</A>

<li><A HREF="#reintegrate_sims">
    Periodic Reintegration of Similarities
</A>

<li><A HREF="#pins_and_clusters">
    Computing "Pins" and "Clusters"
</A>

<li><A HREF="#auto_annotation">
    Automatic Annotation of Genomes
</A>

</ul>


<h2 id="backups">Backing Up Your Data</h2>
The data and code stored within the SEED are organized as follows:
<pre>
	~fig				     on a Mac: /Users/fig; on Linux: /home/fig
		FIGdisk
			dist                 source code
			FIG
				Tmp          temporary files
				Data         data in readable form
</pre>
<ol><li>
The directory <b>FIGdisk</b> holds both the code and data for the
SEED.  The data is loaded into a database system that stores the data
in a location external to FIGdisk, but otherwise a running SEED is
encapsulated within FIGdisk.  A symbolic link to FIGdisk is maintained 
in the directory ~fig.
<br>
<li>
Within FIGdisk there are a two key directories:
<br>
<br><ol><li>
<b>dist</b> contains the source code, and

<li>
<b>FIG</b> contains the execution environment and Data.
</ol>
<br>
<li>
Within FIG, there are a number of directories.  The most important are
<br>
<br>
<ol>
<li>
<b>Data</b>, which contains all of the data in a human-readable form,
and
<br>
<br>
<li>
<b>Tmp</b>, which contains the temporary files built by SEED in
response to commands.
</ol>
</ol>
<br>
Hence, to backup your data, you should simply copy the Data
directory.  It should be backed up to a separate disk.  Suppose that
/Volumes/Backup is a backup disk.  Then,
<br>
<pre>
	cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
	gzip -r /Volumes/Backup/Data.Backup
</pre>
<br>
would be a reasonable way to make a backup.  The copy preserves
permissions, copies recursively, and does not follow symbolic links.
<br>
<h2 id="copying">Copying a Version of the SEED</h2>

To make a second copy of the SEED (either for a friend or for yourself), you should use tar
to preserve a few symbolic links (which are relative, not absolute; this means that they can
be copied while still preserving the integrity of the whole system).
So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it
to /Volumes/To.  Use 
<pre>
   cd /Volumes/From
   tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)
</pre>
<p>This should produce the desired copy.  In this case, suppose that we are in a
  Mac OS X
  environment, and <b>From</b> and <b>To</b> are firewire disks.  To install the system on a friends
  Mac, you would unmount <b>To</b>, plug it into the new machine, and then set the symbolic link to the active
  FIGdisk using
  <br>
</p>
<table border="1" bgcolor="#CCCCCC">
  <tr>
    <td width="403"><font face="Courier New, Courier, mono">cd ~fig</font></td>
    <td width="285">&nbsp;</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">rm FIGdisk</font></td>
    <td># fails if there is no existing FIGdisk on the machine</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk</font></td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">bash</font></td>
    <td>Switch to using the bash shell</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">cd FIGdisk</font></td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">cp CURRENT_RELEASE DEFAULT_RELEASE</font></td>
    <td># Causes the new configuration to use the code that was running in the
      original installation</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">./configure <em>arch-name</em></font></td>
    <td># Configure the new SEED disk for architecture <em>arch-name</em>. </td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
    </font></td>
    <td># Set up the environment for using the SEED</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
    </font></td>
    <td># Start the database server and registration servers</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">init_FIG <br>
    </font></td>
    <td># Initialize a new relational database</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">fig load_all</font></td>
    <td># Load the database from the SEED data files. This may take several hours</td>
  </tr>
</table>
<p>At this point, the new SEED copy should be ready to use. You only need to
  perform the configure, init_FIG, and fig load_all steps once after installing
  a new copy of the SEED. After a reboot or other clean start of the computer,
  you will only have to do these steps:</p>
<table border="1" bgcolor="#EEEEEE">
  <tr>
    <td width="403"><font face="Courier New, Courier, mono">cd ~fig/FIGdisk</font></td>
    <td width="285">&nbsp;</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">bash</font></td>
    <td>Switch to using the bash shell</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
    </font></td>
    <td># Set up the environment for using the SEED</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
    </font></td>
    <td># Start the database server and registration servers</td>
  </tr>
</table>
<p>Upon setting up a new computer for running SEED, you should read the full
  documentation for SEED installation, as it has a number of platform-specific
  modifications that need to be performed. This document can currently be found
  at the following
location in the SEED Wiki:  </p>
<blockquote>
  <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions">	http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
</blockquote>
<h2 id="multiple_copies">Running Multiple Copies of the SEED</h2>

For individual users that use the SEED to support comparative analysis, a single copy is completely
adequate.  Adding genomes can usually be done without disrupting normal use, and a very occasional major 
reorganization that runs over the weekend is not a big deal.  
<p>
The situation is somewhat different when the system is being used to support a major sequencing/annotation
effort.  In this case, you have a user community that is sensitive to disruptions of service, and you
have frequent demands to update versions of data.  In this case, it is best to have two systems: the 
<b>production system</b> is used to support the larger user community, and the <b>update system</b> is
used to prepare updated versions of the system.  
New genomes are added to the update system, and then periodically a
revised Data directory is extracted to update the production system.
Even so, work stoppages of a few hours will occur when 
new releases are swapped in.
<p>
This use of an "update" and a "production" system is quite analogous
to running a production system which is occasionally updated from new
Data DVDs (which FIG normally makes available about every 4-6 months).
That is, in both cases you are updating a production system from a
newly created <b>Data</b> directory that is lacking assignments and
annotations that exist on your production system.  However, if you have
added new genomes to the production system (that are not part of the
releases you may acquire via DVDs), you should get the new release,
install the versions of your local genomes, and then do this update
procedure. 
<p>
The plan we propose is to build a completely encapsulated new version
of the system, then capture updates from the old production system, update
the new production system, and then make the new version the actual
production system.  This last step amounts to altering a symbolic link
to point at the new production system rather than the old.  This has
the virtue of ease of recovery -- that is, if something goes wrong you
can flip back to the old system.
The actual steps are as follows:
<ol>

<li> First, make sure that you are in the BASH shell by typing "echo $SHELL"; 
   if the result is not "bash", type "bash" to enter the BASH shell.
<p>

<li> Next, check that the result of typing "which perl" is the version
   of perl owned by the SEED; it should look something like
   <pre>
       /Users/fig/FIGdisk/env/mac/bin/perl
   </pre>
   although the exact results will depend on where your existing copy 
   of the SEED is installed, whether your platform is a Macintosh or LINUX,
   etc. If the result does not look similar to the above, type:
   <pre>
       source Path_to_FIGdisk/config/fig-user-env.sh
   </pre>
   to setup your FIG environment properly.
<p>

<li> Next, make a copy of the Code Distribution Environment (from a DVD
or via the network).  Suppose that we have made such a directory in
CodeDistEnv.  Then use,
<pre>
	cd CodeDistEnv
	./install-code TargetDirectory
</pre>
where <b>TargetDirectory</b> is where you wish to build the new
production version.  We recommend calling it something like
<b>FIGdisk.July24</b>.
<p>

<li> Stop all work on the production machine for the duration of the update. 
     You do this by clicking on the "Seed Control Panel" link,
     and then entering an explanatory message in the text box 
     and clicking on the "Disable SEED server" button.
<p>

<li> You now need to capture the assignments, annotations and
     subsystems work that has been done on the production machine.  
     To do this, you need to know when the last production release 
     was installed.  Suppose that it was July 1, 2004.  
     If that was the date, we recommend that you run
     <pre>
        <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004</b>
     </pre>

     This will capture your updates and save them in the directory
     /tmp/sync.data.july.1.2004.<br>
<p>

<li>Now, you need to stop the existing production system using
<pre>
	~/FIGdisk/bin/stop-servers
</pre>
<p>

<li>Now, you need to configure the runtime environment for the system
you are running on.  
To do this, use
<pre>
	cd TargetDirectory
	./configure MacOrLinux
</pre>
where <b>MacOrLinux</b> must be a currently supported environment.
Those that are supported on July 24, 2004 are <b>mac</b> for
Macintoshes running panther, <b>mac-jaguar</b> for those that have not
upgraded to panther, and <b>linux-postgres</b>.
<p>

<li>Now, you need to insert the new Data directory into the newly
constructed version of the SEED.  To do this use
<pre>
	chmod -R 777 TheNewData
	cd TargetDirectory/FIG
	ln -s TheNewData Data
</pre>
where TheNewData is the new Data directory, which normally comes  from the
update system.  If you acquired a new Data directory via Data DVDs, you
will need to unpack them using the README instructions, but what
results is a new version of the <b>Data</b> directory.
<p>

<li>Now, you need to start the servers in order to load the databases
with the new release using
<pre>
	cd TargetDirectory/bin
	./start-servers
	cd ..
	source config/fig-user-env.sh
	init_FIG
	fig load_all
</pre>
This last command will run for several hours.
<p>

(<b>WARNING:</b> Please note that, because the new SEED's databases 
do not yet exist, the `init_FIG` command will generate two totally
harmless but rather terrrifying error messages the very first time it is executed, 
so that its output will look something like this:

<pre>
DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL:  Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21

Initializing new SEED database fig

ERROR:  DROP DATABASE: database "fig" does not exist
dropdb: database removal failed
CREATE DATABASE
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
CREATE TABLE

Complete. You will need to run "fig load_all" to load the data.
</pre>
We recognize that that generating the above two faux "FATAL" errors 
constitutes a rather ugly and inelegant implementation,
but we have not yet found a more elegant database initialization method 
that can avoid generating them.)
<p>

<li> Now, you need to capture the changes made to the old production
     version using something like
     <pre>
         <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
     </pre>
<p>

<li>Run
    <pre>
	index_annotations
	index_subsystems
	make_indexes
    </pre>	
<p>

<li> Now, finally, you should alter the symbolic link in <i>~fig</i> to
the current FIGdisk using something like:
<pre>
	cd ~fig
	rm FIGdisk     # should be removing a symbolic link to the current SEED
	ln -s TargetDirectory FIGdisk
</pre>
That should make the new SEED the one available through the Web interface.
<p>

<li> You should now bring your update system to the same state as the
     production system.  This can be done by making sure that
     <b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.
     If the production and update systems are run on the same machine, then
     the directory is already there.  If not, copy it to <b>/tmp</b> on the
     update machine.  Then run
     <br>
     <pre>
	 <b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
     </pre>
     <br>
     on the update machine.
</ol>
<p>

Our experience is that anytime a group wishes to share a common production environment,
this 2-system approach is the way to do it.  You can, if necessary,
put both systems on the same physical machine.  This does require some
special handling in setting up two different <b>FIGdisk</b>
directories.  We recommend using <b>FIGdisk.production</b> and
<b>FIGdisk.update</b>.  However, in general it makes sense to use two
separate physical machines, for backup if nothing else.  The update
system can usually be run on a $2000 (or less) box, although it is
desirable to spend a little more and get at least 1 gigabyte of main
memory and 200 gigabytes of external disk.
<br>
<h2 id="adding_genomes">Adding a New Genome to an Existing SEED</h2>
To add a new genome to a running SEED is fairly easy, but there are a
number of details that do have to be handled with care.  
<p>
The first thing to note is that the SEED does not include tools to call genes -- you are expected
to provide gene calls.  This may change at some point, but for now you must call your own genes.  A
number of good tools now exist in the public domain, and you will need to find one that seems adequate
for your needs.
<p>
Let us now
cover how to prepare the actual data.  You need to construct a directory (in somewhere like ~fig/Tmp)
of the following form:
<br>
<table width="100%">
<tr>
<td><tt>GenomeId</tt></td>
<td></td>
<td></td>
<td></td>
<td>of the form xxxx.y where xxxx is the taxon ID and y is an integer</td>
</tr>

<tr>
<td></td>
<td><tt>PROJECT</tt></td>
<td></td>
<td></td>
<td> a file containg a description of the source of the data</td>
</tr>

<tr>
<td></td>
<td><tt>GENOME</tt></td>
<td></td>
<td></td>
<td>a file containing a single line identifying the genus, species and strain</td>
</tr>

<tr>
<td></td>
<td><tt>TAXONOMY</tt></td>
<td></td>
<td></td>
<td>a file containing a single line containing the NCBI taxonomy</td>
</tr>

<tr>
<td></td>
<td><tt>RESTRICTIONS</tt></td>
<td></td>
<td></td>
<td>a file containing a description of distribution restrictions (optional)</td>
</tr>

<tr>
<td></td>
<td><tt>CONTIGS</tt></td>
<td></td>
<td></td>
<td>contigs in fasta format</td>
</tr>

<tr>
<td></td>
<td><tt>assigned_functions</tt></td>
<td></td>
<td></td>
<td>function assignments for the protein-encoding genes (optional)</td>
</tr>

<tr>
<td></td>
<td><tt>Features</tt></td>
</tr>

<tr>
<td></td>
<td></td>
<td><tt>peg</tt></td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>tbl</tt></td>
<td>describes locations and aliases for the protein-encoding genes</td>
</td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>fasta</tt></td>
<td>fasta file of translations of the protein-encoding genes</td>
</td>
</tr>

<tr>
<td></td>
<td></td>
<td><tt>rna</tt></td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>tbl</tt></td>
<td>describes locations and aliases for the rna-encoding genes</td>
</td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>fasta</tt></td>
<td>fasta file of the DNA corresponding to the genes</td>
</td>
</tr>


</table>

<!--

<pre>
	GenomeID                          of the form xxxx.y where xxxx is the taxon ID and y is an integer

		PROJECT                   a file containg a description of the source of the data

		GENOME			  a file containing a single line identifying the genus, species and strain

		TAXONOMY		  a file containing a single line containing the NCBI taxonomy

		RESTRICTIONS		  a file containing a description of distribution restrictions (optional)

		contigs			  contigs in fasta format

		assigned_functions	  function assignments for the protein-encoding genes (optional)

		Features

			peg
				tbl       descibes locations and aliases for the protein-encoding genes

				fasta     fasta file of translations of the protein-encoding genes

			rna
				tbl       describes locations and aliases for the rna-encoding genes

				fasta     fasta file of the DNA corresponding to the genes
</pre>
-->
<br>
<br>
Let us expand on this very brief description:
<ol>
<li>
The name of the directory must be of the form xxxx.y where xxxx is the
taxon ID, and y is a sequence number.  For example, 562.1 might be
used for <i>E.coli</i>, since 562 is the NCBI taxon ID for
<i>Escherichia coli</i>.  The sequence number (y) is used to
distinguish multiple genomes having the same taxon ID. 
<br><br>
<li>
The assigned_functions file contains assignments of function for the
protein-encoding genes.  is of the form
<pre>
		Id\tFunction\tConfidence  (\t stands for a tab character)
</pre>
The Id must be a valid PEG Id.  These are of the form:
<pre>
		fig|xxxx.y.peg.z
</pre>
where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes
the peg (protein-encoding gene).
<br>
<i>Confidence</i> is a single character code: 
<br>
<ul>
<li>a space for "normal"
<li>w for "weak"
<li>e for experimentally verified
<li>s for "strong evidence (but not experimental)"
</ul>
The second tab and the confidence code can be omitted (it will default to a space).
The assigned_functions file is optional.  You can leave it blank and, after adding the genome
to the SEED, ask for automated assignments.
<br><br>
<li>
The tbl files specify the locations of genes, as well as any aliases.  Each line in a tbl line 
is of the form
<br>
<pre>
	Id\tLocation\tAliases    (the aliases are separated by tabs)
</pre>
The Id must conform to the fig|xxxx.y.peg.z format described above.  The <i>Location</i> is of the form
<br>
<pre>
	L1,L2,L3...Ln

where each Li describes a region on a contig and is of the form

	<i>Contig_Begin_End</i> where

	      Contig is the Id of the contig,
	      Begin is the position of the first character, and
	      End is the position of the last character
</pre>
<ul>
<li>if Begin > End, the region being described is on the complementary strand, and
<li>the End position is the last character preceding the stop codon (i.e., the region
corresponding to a protein-encoding gene is thought of as including all bases from the
first base of the start codon to the last base before the stop codon.
</ul>
For example,
<pre>
fig|562.1.peg.15	Escherichia_coli_K12_14168_15295	dnaJ	b0015	sp|P08622	gi|16128009 
</pre>
describes the <i>dnaJ</i> gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12.
The gene is from the genome 562.1, and it has 4 specified aliases.
<li>
The fasta files must have gene Ids that match tbl file entries.  The <i>peg</i> fasta file contains translations,
while the <i>rna</i> fasta file contains DNA sequences.
<li>
Both the <i>peg</i> and the <i>rna</i> subdirectories are optional.
</ol>
<br>
The SEED provides a utility that can be used to produce such a directory from a GenBank entry.  Thus,
<br>
<pre>
	parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
</pre>
would attempt to produce a properly formatted directory (~/Tmp/562.4) containing
the data encoded in the GenBank entry from the file <i>genbank.entry.for.a.new.E.coli.genome</i>.
This script is far from perfect, and there is huge variance in encodings in GenBank 
files.  So, use it at your own risk (and, manually check the output).
<p>
You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory 
to see examples of how it should be done.
<p>
So, supposing that you have built a valid directory (say, <i>/Users/fig/Tmp/562.4</i>), you can add the genome using
<pre>
	fig add_genome /Users/fig/Tmp/562.4
</pre>
<br>
The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
be computed for the protein-encoding genes.

<h2 id="importing_external">Importing External Protein Data</h2>

The presence of external judgements about the possible functions of encoded proteins
is one of the essential aspects of the SEED.  It becomes important that one be able to
add new sources of annotation, as well as periodically updating the judgements of
existing sources.  To update the external sets of proteins and annotations, build a new nonredundant 
database of proteins, and compute the associated similarities, one should proceed as follows:

<ol>
<li> Stop using the system until this procedure completes.
<br><br>
<li> Update the NR Directory
<br><br>
The <b>NR</b> directory is located within the <b>Data</b> directory:
<br>
<pre>
	~fig				          on a Mac: /Users/fig; on Linux: /home/fig
		FIGdisk
			dist                      source code
			FIG
				Tmp               temporary files
				Data              data in readable form
				          NR	  Contains external Data
					  
</pre>

The <b>NR</b> directory contains one subdirectory for each source of external
assignments (the released SEED includes subdirectories for SwissProt, NCBI, UniProt, and KEGG). 
You may add more subdirectories.
<p>
Each subdirectory must include 3 files:
<ol>
<li> <b>fasta</b> should be a fasta file containing the protein sequences.  These sequences will
be used to establish a correspondence between these IDs and other protein sequences within the SEED.
<br><br>
<li> <b>org.table</b> is a two-column, tab-separated table.  Column 1 is the ID, and column 2 is the 
organism corresponding to the ID.
<br><br>
<li> <b>assign_functions</b> is a 2-column table.  The ID is in column 1, and column 2 contains the
gene function (often called a <i>product name</i>) asserted by the external source.
</ol>
<br>
You should proceed only when you have updated as many of the sources as you wish.
<br><br>
<li> Now run
<pre>
       import_external_sequences_step1
</pre>

This program will build a new nonredundant database, check to see what has changed, and will
build the input required to compute new similarities.
<br><br>
<li> Compute the needed similarities

You will need three files to compute a new batch of similarities.  The locations of these
three files are displayed by <b>import_external_sequences_step1</b> just before completion
(i.e., you should have gotten them as the output of the last step).  Compute the similarities (see
the discussion below) and store them in the <b>NewSims</b> directory (again the precise location
was displayed by <b>import_external_sequences_step1</b>).
<br><br>
<li> Run
<pre>
       import_external_sequences_step3
</pre>
</ol>

<h2 id="sims">Computing Similarities</h2>

Adding a genome does not automatically get similarities computed for the new genome.
To get the similarities actually computed, you need to compute them and make them available in 
the <b>FIGdisk/FIG/Data/NewSims</b> directory.
<p>
To compute similarities, you will need to do the following:
<ol>
<li>The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in
<b>~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta</b>.  A copy of this was appended to
<b>~fig/FIGdisk/FIG/Data/Global/nr</b> when your genome was added.  <b>nr</b> is the "nonredundant database"
we use to compute similarities (and the one you must use).  To get the initial blast results, you would use something 
like
<br>
<pre>
          blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
</pre>
<br>
which produces the blast results in a tab-separated format.  The invocation of <b>reduce_sims</b> is optional.
It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
<li>
The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences.  To add that, you would use
<br>
<pre>
        reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
</pre>
<br>
This will actually append two columns to each similarity and place the results in the <b>NewSims</b>
directory where it should be.
</ol>
<p>
The above description will produce similarities using a single invocation of
blastall.  For most large genomes, and whenever you wish to process a batch of genomes, 
you should use parallel processing while maintaining the spirit of the approach.
No matter how you produce the new similarities, they need to be added
as a file in the <b>FIGdisk/FIG/Data/NewSims</b> directory.  Then, you
need to index these similarities using
<pre>
	index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
</pre>
where XXXX is the file you added.  If you have more than one such
file, just put in several arguments for the command.  This will
"index" the similarities in that any of the new PEGs which have
similarities connecting them to other PEGs from the existing genomes
can now be displayed.  However, the connection from the existing
genomes to the new PEGs does not yet exist (we call these the "flips"
of the computed sims).  To get this ability, you need to go through a
process that will make your system unavailable for a period (and, it
will produce a substantial load on your system for a day or so, while
the SEED sorts, sifts, inserts, and generally plays with the "flips").
<br>
The extra steps you need to take to make a fully functional version
are as follows:
<ol>
<li>
First, you need to run 
<pre>
	update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
</pre>
This should produce updated similarity files in a VERY BIG directory
that we happened to put at <i>~/Tmp/FlippedSims</i> (but, which you could
put anywhere).  This may run as much as a day or so (and you can watch
its progress as it updates the similarity files). 
<li>The next step is to replace the existing similarity files with the
newly computed ones.  You need to make the SEED unavailable (via the
<b>SEED Control Panel</b>.
<li>Then, blow away the existing similarities using something like
<pre>
	rm ~/FIGdisk/FIG/Data/Sims/*
	rm ~/FIGdisk/FIG/Data/NewSims/*
	cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
	rm -r ~/Tmp/FlippedSims
</pre>
There are several ways to do this.  You might want to save the old
similarities somewhere.  You might be able to move (rather than copy),
the similarities.  Whatever suits you.
<li> Then run
<pre>
	index_sims
</pre>
to re-index all of the similarities, and you should be fully
operational.
</ol>
<br>

<h2 id="deleting_genomes">Deleting Genomes from a Version of the SEED</h2>

There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy
of the SEED containing a subset of the entire collection of genomes.
<p>
To delete a set of genomes from a running version of the SEED, just use
<pre>
	fig mark_deleted_genomes User G1 G2 ...Gn  (where G1 G2 ... Gn designates a list of genomes)
</pre>
For example,
<pre>
	fig mark_deleted_genomes RossO 562.1
</pre>
could be used to delete a single genome with a genome ID of 562.1.

 <h2 id="reintegrate_sims">Periodic Reintegration of Similarities</h2>

When the initial SEED was constructed, similarities were computed.  For most similarities of the form 
"Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2.  This is not always true,
since we truncate the number of similarities associated with any single Id (leaving us in a situation
in which we may have similarity recorded for Id1, but not Id2).  When a genome is added, if Id1 was an added
protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2.  This means that when looking
at genes from previously existing organisms, you never get links back to the added pegs.  This is not totally
satisfactory.
<p>
Periodically, it is probably a good idea to "reinitegrate the similarities".  This can be done by
just running
<pre>
        reintegrate_sims
#	update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
</pre>
The job will probably run for quite a while (perhaps as much as a day or two).  

<h2 id="pins_and_clusters">Computing "Pins" and "Clusters"</h2>

The SEED displays potentially significant clusters on prokaryotic chromosomes.  In the
process of finding preserved contiguity, it computes "pins", which are simply a set of genes
that are believed to be orthologs that cluster with similar genes.  If you add your own genome,
you will probably want to compute and enter these into the active database.  This can be done
using
<pre>
	compute_pins_and_clusters G1 G2 G3 ...
</pre>
where the arguments are genome Ids.  Thus,
<pre>
	compute_pins_and_clusters 562.4
</pre>
would compute and add entries for all of the <i>pegs</i> in genome 562.4.

<h2 id="auto_annotation">
   Automatic Annotation of Genomes
</h2>
The SEED provides a simple but limited capability for automated assignment 
of protein-encoding gene function based on similarity. 
Candidate functions are assigned scores based on the combined strengths 
of all BLASTP similarities to genes carrying that particular assignment, 
weighted by the provenance and assignment-confidence for each similar gene.
The final automated function assignment is then determined from the
list of candidate functions and their associated scores.

Automated assignment is a four-step process:
<ol>
<li> Create a list of PEGs to be automatically assigned. 
If one wishes to make assignments to an entire organism or set of organisms
that are already installed in the SEED, the simplest method for creating 
this list is to type the following command:
<pre>
    pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
</pre>

<p>
<li> Next, create a list of candidate function-assignments using the following 
command:
<pre>
   auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
</pre>
(NOTE: The `auto_assign` command has some additional optional parameters;
for example, if one knows that all the PEGs in 'peg.list' are from 
prokaryotic organisms, one can make use of this additional informaation
by invoking `auto_assign` as follows:
<pre>
   auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
</pre>
Also, if one wishes to use an alternate file of similarity data named 'simfile' 
instead of the precomputed similarities stored in the SEED, one can instead type:
<pre>
   auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
</pre>
Finally, `auto_assign` can read a set of alternate parameters from a file, 
but we recommend that you stick with the default settings, and not exploit this
last feature unless you are a qualified SEED wizard.)
<p>

<li> Next, create a SEED format assigned-functions file as follows:
<pre>
    make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
</pre>
Alternately, if you wish to suppress the class of "non-informative" function assignments
such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
you may do so using the '-no_hypos' flag:
<pre>
    make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
</pre>

<li> Finally, install the automated assignments in the seed using the command
<pre>
    fig assign_functionF master:automated_assignments  ~/Tmp/assigned_functions
</pre>

</ol>

It should be once again noted that the SEED's automated assignment algorithm 
is quite simple and crude, being only slightly better than simply assigning
the function of the highest-scoring BLASTP hit; however, it at least provides
a "quick and dirty" starting point for making an initial assessment of a genome,
which may then be cleaned up and refined by skilled genome annotators.







MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3