[Bio] / FigTutorial / SEED_administration_issues.html Repository:
ViewVC logotype

View of /FigTutorial/SEED_administration_issues.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (download) (as text) (annotate)
Wed Jul 21 20:25:42 2004 UTC (15 years, 7 months ago) by overbeek
Branch: MAIN
Changes since 1.4: +6 -6 lines
mods to SEED administration guide

<h1>SEED Administration</h1>
<p>This tutorial discusses a number of issues that you will need to know about
  in order to install, share, and maintain your SEED installation.</p>
<h2>Backing Up Your Data</h2>
The data and code stored within the SEED are organized as follows:
<pre>
	~fig				     on a Mac: /Users/fig; on Linux: /home/fig
		FIGdisk
			dist                 source code
			FIG
				Tmp          temporary files
				Data         data in readable form
</pre>
<ol><li>
The directory <b>FIGdisk</b> holds both the code and data for the
SEED.  The data is loaded into a database system that stores the data
in a location external to FIGdisk, but otherwise a running SEED is
encapsulated within FIGdisk.  A symbolic link to FIGdisk is maintained 
in the directory ~fig.
<br>
<li>
Within FIGdisk there are a two key directories:
<br>
<br><ol><li>
<b>dist</b> contains the source code, and

<li>
<b>FIG</b> contains the execution environment and Data.
</ol>
<br>
<li>
Within FIG, there are a number of directories.  The most important are
<br>
<br>
<ol>
<li>
<b>Data</b>, which contains all of the data in a human-readable form,
and
<br>
<br>
<li>
<b>Tmp</b>, which contains the temporary files built by SEED in
response to commands.
</ol>
</ol>
<br>
Hence, to backup your data, you should simply copy the Data
directory.  It should be backed up to a separate disk.  Suppose that
/Volumes/Backup is a backup disk.  Then,
<br>
<pre>
	cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
	gzip -r /Volumes/Backup/Data.Backup
</pre>
<br>
would be a reasonable way to make a backup.  The copy preserves
permissions, copies recursively, and does not follow symbolic links.
<br>
<h2>Copying a Version of the SEED</h2>

To make a second copy of the SEED (either for a friend or for yourself), you should use tar
to preserve a few symbolic links (which are relative, not absolute; this means that they can
be copied while still preserving the integrity of the whole system).
So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it
to /Volumes/To.  Use 
<pre>
   cd /Volumes/From
   tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)
</pre>
<p>This should produce the desired copy.  In this case, suppose that we are in a
  Mac OS X
  environment, and <b>From</b> and <b>To</b> are firewire disks.  To install the system on a friends
  Mac, you would unmount <b>To</b>, plug it into the new machine, and then set the symbolic link to the active
  FIGdisk using
  <br>
</p>
<table border="1" bgcolor="#CCCCCC">
  <tr>
    <td width="403"><font face="Courier New, Courier, mono">cd ~fig</font></td>
    <td width="285">&nbsp;</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">rm FIGdisk</font></td>
    <td># fails if there is no existing FIGdisk on the machine</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk</font></td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">bash</font></td>
    <td>Switch to using the bash shell</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">cd FIGdisk</font></td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">cp CURRENT_RELEASE DEFAULT_RELEASE</font></td>
    <td># Causes the new configuration to use the code that was running in the
      original installation</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">./configure <em>arch-name</em></font></td>
    <td># Configure the new SEED disk for architecture <em>arch-name</em>. </td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
    </font></td>
    <td># Set up the environment for using the SEED</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
    </font></td>
    <td># Start the database server and registration servers</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">init_FIG <br>
    </font></td>
    <td># Initialize a new relational database</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">fig load_all</font></td>
    <td># Load the database from the SEED data files. This may take several hours</td>
  </tr>
</table>
<p>At this point, the new SEED copy should be ready to use. You only need to
  perform the configure, init_FIG, and fig load_all steps once after installing
  a new copy of the SEED. After a reboot or other clean start of the computer,
  you will only have to do these steps:</p>
<table border="1" bgcolor="#EEEEEE">
  <tr>
    <td width="403"><font face="Courier New, Courier, mono">cd ~fig/FIGdisk</font></td>
    <td width="285">&nbsp;</td>
  </tr>
  <tr>
    <td><font face="Courier New, Courier, mono">bash</font></td>
    <td>Switch to using the bash shell</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono"> source config/fig-user-env.sh <br>
    </font></td>
    <td># Set up the environment for using the SEED</td>
  </tr>
  <tr>
    <td height="23"><font face="Courier New, Courier, mono">start-servers <br>
    </font></td>
    <td># Start the database server and registration servers</td>
  </tr>
</table>
<p>Upon setting up a new computer for running SEED, you should read the full
  documentation for SEED installation, as it has a number of platform-specific
  modifications that need to be performed. This document can currently be found
  at the following
location in the SEED Wiki:  </p>
<blockquote>
  <p><a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions">	http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions</a></p>
</blockquote>
<h2>Running Multiple Copies of the SEED</h2>

For individual users that use the SEED to support comparative analysis, a single copy is completely
adequate.  Adding genomes can usually be done without disrupting normal use, and a very occasional major 
reorganization that runs over the weekend is not a big deal.  
<p>
The situation is somewhat different when the system is being used to support a major sequencing/annotation
effort.  In this case, you have a user community that is sensitive to disruptions of service, and you
have frequent demands to update versions of data.  In this case, it is best to have two systems: the 
<b>production system</b> is used to support the larger user community, and the <b>update system</b> is
used to prepare updated versions of the system.  Even so, work stoppages of 4-8 hours will occur when 
new releases are swapped in.  To swap in new data from the update system to the production system,
you need to 
<ol>
<li>stop all work on the production machine by clicking on the "Seed Control Panel" link,
entering an explanatory message in the text box, and clicking on the "Disable SEED server" button.
<li>You now need to capture the assignments, annotations and
subsystems work that has been done on the production machine.  
To do this, you need to know when the last production release 
was installed.  Suppose that it was July 1, 2004.  
If that was the date, we recommend that you run<br><br>
<pre>
    <b>extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004<</b>
</pre>
<br><br>
This will capture your updates and save them in the directory
/tmp/sync.data.july.1.2004.
<li>Now, you need to replace your <b>Data</b> directory (within
<b>FIGdisk/FIG</b>) with the new version from the update system.  We
suggest that you do the following:
<ol>
<li>archive the existing <b>Data</b> directory.  These can usually be
discarded within a month or two, but keeping them around is a good
safety measure.
<li>move a copy of the update <b>Data</b> directory into the
<b>FIGdisk/FIG</b> directory.
</ol>
At this point, you have a version of the data from the update system
in the right location, but the internal databases all contain the old data.
<li> Now, run
<pre>
	<b>fig load_all</b>
</pre>
to reload the production databases with the data from the newly inserted Data directory.
This will usually take several hours.
<li>Now, you need to capture the changes made to the old production
version using something like
<br>
<pre>
	<b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
</pre>
<br>
<li> make the production machine available for use.
<li>You should now bring your update system to the same state as the
production system.  This can be done by making sure that
<b>/tmp/sync.data.july.1.2004</b> is accessible to the update system.
If the production and update systems are run on the same machine, then
the directory is already there.  If not, copy it to <b>/tmp</b> on the
update machine.  Then run
<br>
<pre>
	<b>sync_new_system /tmp/sync.data.july.1.2004 make-assignments</b>
</pre>
<br>
on the update machine.
</ol>
Our experience is that anytime a group wishes to share a common production environment,
this 2-system approach is the way to do it.  You can, if necessary,
put both systems on the same physical machine.  This does require some
special handling in setting up two different <b>FIGdisk</b>
directories.  We recommend using <b>FIGdisk.production</b> and
<b>FIGdisk.update</b>.  However, in general it makes sense to use two
separate physical machines, for backup if nothing else.  The update
system can usually be run on a $2000 (or less) box, although it is
desirable to spend a little more and get at least 1 gigabyte of main
memory and 200 gigabytes of external disk.
<br>
<h2>Adding a New Genome to an Existing SEED</h2>
To add a new genome to a running SEED is fairly easy, but there are a
number of details that do have to be handled with care.  
<p>
The first thing to note is that the SEED does not include tools to call genes -- you are expected
to provide gene calls.  This may change at some point, but for now you must call your own genes.  A
number of good tools now exist in the public domain, and you will need to find one that seems adequate
for your needs.
<p>
Let us now
cover how to prepare the actual data.  You need to construct a directory (in somewhere like ~fig/Tmp)
of the following form:
<br>
<table width="100%">
<tr>
<td><tt>GenomeId</tt></td>
<td></td>
<td></td>
<td></td>
<td>of the form xxxx.y where xxxx is the taxon ID and y is an integer</td>
</tr>

<tr>
<td></td>
<td><tt>PROJECT</tt></td>
<td></td>
<td></td>
<td> a file containg a description of the source of the data</td>
</tr>

<tr>
<td></td>
<td><tt>GENOME</tt></td>
<td></td>
<td></td>
<td>a file containing a single line identifying the genus, species and strain</td>
</tr>

<tr>
<td></td>
<td><tt>TAXONOMY</tt></td>
<td></td>
<td></td>
<td>a file containing a single line containing the NCBI taxonomy</td>
</tr>

<tr>
<td></td>
<td><tt>RESTRICTIONS</tt></td>
<td></td>
<td></td>
<td>a file containing a description of distribution restrictions (optional)</td>
</tr>

<tr>
<td></td>
<td><tt>CONTIGS</tt></td>
<td></td>
<td></td>
<td>contigs in fasta format</td>
</tr>

<tr>
<td></td>
<td><tt>assigned_functions</tt></td>
<td></td>
<td></td>
<td>function assignments for the protein-encoding genes (optional)</td>
</tr>

<tr>
<td></td>
<td><tt>Features</tt></td>
</tr>

<tr>
<td></td>
<td></td>
<td><tt>peg</tt></td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>tbl</tt></td>
<td>describes locations and aliases for the protein-encoding genes</td>
</td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>fasta</tt></td>
<td>fasta file of translations of the protein-encoding genes</td>
</td>
</tr>

<tr>
<td></td>
<td></td>
<td><tt>rna</tt></td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>tbl</tt></td>
<td>describes locations and aliases for the rna-encoding genes</td>
</td>
</tr>

<tr>
<td></td>
<td></td>
<td></td>
<td><tt>fasta</tt></td>
<td>fasta file of the DNA corresponding to the genes</td>
</td>
</tr>


</table>

<!--

<pre>
	GenomeID                          of the form xxxx.y where xxxx is the taxon ID and y is an integer

		PROJECT                   a file containg a description of the source of the data

		GENOME			  a file containing a single line identifying the genus, species and strain

		TAXONOMY		  a file containing a single line containing the NCBI taxonomy

		RESTRICTIONS		  a file containing a description of distribution restrictions (optional)

		contigs			  contigs in fasta format

		assigned_functions	  function assignments for the protein-encoding genes (optional)

		Features

			peg
				tbl       descibes locations and aliases for the protein-encoding genes

				fasta     fasta file of translations of the protein-encoding genes

			rna
				tbl       describes locations and aliases for the rna-encoding genes

				fasta     fasta file of the DNA corresponding to the genes
</pre>
-->
<br>
<br>
Let us expand on this very brief description:
<ol>
<li>
The name of the directory must be of the form xxxx.y where xxxx is the
taxon ID, and y is a sequence number.  For example, 562.1 might be
used for <i>E.coli</i>, since 562 is the NCBI taxon ID for
<i>Escherichia coli</i>.  The sequence number (y) is used to
distinguish multiple genomes having the same taxon ID. 
<br><br>
<li>
The assigned_functions file contains assignments of function for the
protein-encoding genes.  is of the form
<pre>
		Id\tFunction\tConfidence  (\t stands for a tab character)
</pre>
The Id must be a valid PEG Id.  These are of the form:
<pre>
		fig|xxxx.y.peg.z
</pre>
where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes
the peg (protein-encoding gene).
<br>
<i>Confidence</i> is a single character code: 
<br>
<ul>
<li>a space for "normal"
<li>w for "weak"
<li>e for experimentally verified
<li>s for "strong evidence (but not experimental)"
</ul>
The second tab and the confidence code can be omitted (it will default to a space).
The assigned_functions file is optional.  You can leave it blank and, after adding the genome
to the SEED, ask for automated assignments.
<br><br>
<li>
The tbl files specify the locations of genes, as well as any aliases.  Each line in a tbl line 
is of the form
<br>
<pre>
	Id\tLocation\tAliases    (the aliases are separated by tabs)
</pre>
The Id must conform to the fig|xxxx.y.peg.z format described above.  The <i>Location</i> is of the form
<br>
<pre>
	L1,L2,L3...Ln

where each Li describes a region on a contig and is of the form

	<i>Contig_Begin_End</i> where

	      Contig is the Id of the contig,
	      Begin is the position of the first character, and
	      End is the position of the last character
</pre>
<ul>
<li>if Begin > End, the region being described is on the complementary strand, and
<li>the End position is the last character preceding the stop codon (i.e., the region
corresponding to a protein-encoding gene is thought of as including all bases from the
first base of the start codon to the last base before the stop codon.
</ul>
For example,
<pre>
fig|562.1.peg.15	Escherichia_coli_K12_14168_15295	dnaJ	b0015	sp|P08622	gi|16128009 
</pre>
describes the <i>dnaJ</i> gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12.
The gene is from the genome 562.1, and it has 4 specified aliases.
<li>
The fasta files must have gene Ids that match tbl file entries.  The <i>peg</i> fasta file contains translations,
while the <i>rna</i> fasta file contains DNA sequences.
<li>
Both the <i>peg</i> and the <i>rna</i> subdirectories are optional.
</ol>
<br>
The SEED provides a utility that can be used to produce such a directory from a GenBank entry.  Thus,
<br>
<pre>
	parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
</pre>
would attempt to produce a properly formatted directory (~/Tmp/562.4) containing
the data encoded in the GenBank entry from the file <i>genbank.entry.for.a.new.E.coli.genome</i>.
This script is far from perfect, and there is huge variance in encodings in GenBank 
files.  So, use it at your own risk (and, manually check the output).
<p>
You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory 
to see examples of how it should be done.
<p>
So, supposing that you have built a valid directory (say, <i>/Users/fig/Tmp/562.4</i>), you can add the genome using
<pre>
	fig add_genome /Users/fig/Tmp/562.4
</pre>
<br>
The <i>add_genome</i> request will add your new genome and queue a computational request that similarities
be computed for the protein-encoding genes.

<h2>Computing Similarities</h2>

Adding a genome does not automatically get similarities computed for the new genome; it queues the request.
To get the similarities actually computed, you need to establish a computational environment on which
the blast runs will be made, and then initiate a request on the machine running the SEED.
<p>
This is not a completely trivial process because there are a variety of different ways to compute
similarities:
<ol>
<li> You can just compute them on the system running the SEED.  This can take several days, but this
is often a perfectly reasonable way to get the job done.
<li>Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines),
and you wish to just exploit these machines to do the blast runs.
<li> Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation).
In this case, it makes sense to utilize a large computational resource, and this resource may either
be a local cluster or a service provided over the net.
</ol>
<br>
To establish the flexibility needed to support all of these alternatives, we implemented the following
approach:
<ul>
<li>
The user can describe one or more <b>similarity computational environments</b> 
in a configuration file called <i>similarities.config</i>.  The details of this encoding
are beyond the scope of this document.
These environments all represent potential ways to compute similarities.
<br>
<li>
When a SEED systems administrator (usually, the normal SEED user) wishes to run similarities,
he runs a program specifying a specific similarity computational environment.  This causes all
the queued similarity requests to be batched up and sent off to the specified server (which may simply
be on the same machine).  He would use the <b>generate_similarities</b> command specifying two parameters: the
first specifies a similarities computational environment, and the second specifies whether or not automated assignments
should be computed as the similarity computations complete and the results are installed.
As the similarities complete, they will automatically be installed.  Further, if a set of similarities arrive
for a given protein-encoding gene, and if there is no current assignment of function for the gene,
an automated assignment may be computed.  Whether or not such automated assignments are computed is determined
by the second parameter in the command used by the systems administrator to initiate the request.  For example,
<pre>
	generate_similarities local auto-assignments
</pre>
specifies a similarity computational environment labeled <i>local</i>, which presumably means "run the blast
requests on this machine", and requests automated assignments for all protein-encoding genes that currently either
have no assigned function or have an assigned function that is "hypothetical".
</ul>
<br>

We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined
interfaces for handling high-volume requests.  At FIG, we will maintain a set of instructions on how to set up
your configuration to exploit these resources.

<h2>Deleting Genomes from a Version of the SEED </h2>

There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is
when you wish to replace an existing version of a genome (in which case the replacement is viewed as first
deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy
of the SEED containing a subset of the entire collection of genomes.
<p>
To delete a set of genomes from a running version of the SEED, just use
<pre>
	fig delete_genomes G1 G2 ...Gn  (where G1 G2 ... Gn designates a list of genomes)
</pre>
For example,
<pre>
	fig delete_genomes 562.1
</pre>
could be used to delete a single genome with a genome ID of 562.1.
<p>
To make a copy with some genomes deleted to give to someone else requires a little different approach.
To extract a set of genomes from an existing version of the SEED, you need to run the command
<pre>
	extract_genomes Which ExistingData ExtractedData
</pre>

The first argument is either the word "unrestricted" or the name of a file containing a list of
genome IDs (the genomes that are to be retained in the extraction).  The second argument is
the path to the current Data directory.  The third argument specifies the name of a directory
that is created holding the extraction.  Thus,
<pre>
	extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
</pre>
would created the extracted Data directory for you.  If you wish to then produce a fully distributable
version of the SEED from the existing version and the extracted Data directory, you would
use
<pre>
	make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
	rm -rf /Volumes/Tmp/ExtractedData
</pre>

<h2>Periodic Reintegration of Similarities</h2>

When the initial SEED was constructed, similarities were computed.  For most similarities of the form 
"Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2.  This is not always true,
since we truncate the number of similarities associated with any single Id (leaving us in a situation
in which we may have similarity recorded for Id1, but not Id2).  When a genome is added, if Id1 was an added
protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2.  This means that when looking
at genes from previously existing organisms, you never get links back to the added pegs.  This is not totally
satisfactory.
<p>
Periodically, it is probably a good idea to "reinitegrate the similarities".  This can be done by
just running
<pre>
        reintegrate_sims
#	update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
</pre>
The job will probably run for quite a while (perhaps as much as a day or two).  

<h2>Computing "Pins" and "Clusters"</h2>

The SEED displays potentially significant clusters on prokaryotic chromosomes.  In the
process of finding preserved contiguity, it computes "pins", which are simply a set of genes
that are believed to be orthologs that cluster with similar genes.  If you add your own genome,
you will probably want to compute and enter these into the active database.  This can be done
using
<pre>
	compute_pins_and_clusters G1 G2 G3 ...
</pre>
where the arguments are genome Ids.  Thus,
<pre>
	compute_pins_and_clusters 562.4
</pre>
would compute and add entries for all of the <i>pegs</i> in genome 562.4.

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3