SEED Administration

This tutorial discusses a number of issues that you will need to know about in order to install, share, and maintain your SEED installation.

Backing Up Your Data

The data and code stored within the SEED are organized as follows:
	~fig				     on a Mac: /Users/fig; on Linux: /home/fig
		FIGdisk
			dist                 source code
			FIG
				Tmp          temporary files
				Data         data in readable form
  1. The directory FIGdisk holds both the code and data for the SEED. The data is loaded into a database system that stores the data in a location external to FIGdisk, but otherwise a running SEED is encapsulated within FIGdisk. A symbolic link to FIGdisk is maintained in the directory ~fig.
  2. Within FIGdisk there are a two key directories:

    1. dist contains the source code, and
    2. FIG contains the execution environment and Data.

  3. Within FIG, there are a number of directories. The most important are

    1. Data, which contains all of the data in a human-readable form, and

    2. Tmp, which contains the temporary files built by SEED in response to commands.

Hence, to backup your data, you should simply copy the Data directory. It should be backed up to a separate disk. Suppose that /Volumes/Backup is a backup disk. Then,
	cp -pRP /Users/fig/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
	gzip -r /Volumes/Backup/Data.Backup

would be a reasonable way to make a backup. The copy preserves permissions, copies recursively, and does not follow symbolic links.

Copying a Version of the SEED

To make a second copy of the SEED (either for a friend or for yourself), you should use tar to preserve a few symbolic links (which are relative, not absolute; this means that they can be copied while still preserving the integrity of the whole system). So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it to /Volumes/To. Use
   cd /Volumes/From
   tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)

This should produce the desired copy. In this case, suppose that we are in a Mac OS X environment, and From and To are firewire disks. To install the system on a friends Mac, you would unmount To, plug it into the new machine, and then set the symbolic link to the active FIGdisk using

cd ~fig  
rm FIGdisk # fails if there is no existing FIGdisk on the machine
ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk  
bash Switch to using the bash shell
cd FIGdisk  
cp CURRENT_RELEASE DEFAULT_RELEASE # Causes the new configuration to use the code that was running in the original installation
./configure arch-name # Configure the new SEED disk for architecture arch-name.
source config/fig-user-env.sh
# Set up the environment for using the SEED
start-servers
# Start the database server and registration servers
init_FIG
# Initialize a new relational database
fig load_all # Load the database from the SEED data files. This may take several hours

At this point, the new SEED copy should be ready to use. You only need to perform the configure, init_FIG, and fig load_all steps once after installing a new copy of the SEED. After a reboot or other clean start of the computer, you will only have to do these steps:

cd ~fig/FIGdisk  
bash Switch to using the bash shell
source config/fig-user-env.sh
# Set up the environment for using the SEED
start-servers
# Start the database server and registration servers

Upon setting up a new computer for running SEED, you should read the full documentation for SEED installation, as it has a number of platform-specific modifications that need to be performed. This document can currently be found at the following location in the SEED Wiki:

http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions

Running Multiple Copies of the SEED

For individual users that use the SEED to support comparative analysis, a single copy is completely adequate. Adding genomes can usually be done without disrupting normal use, and a very occasional major reorganization that runs over the weekend is not a big deal.

The situation is somewhat different when the system is being used to support a major sequencing/annotation effort. In this case, you have a user community that is sensitive to disruptions of service, and you have frequent demands to update versions of data. In this case, it is best to have two systems: the production system is used to support the larger user community, and the update system is used to prepare updated versions of the system. Even so, work stoppages of 4-8 hours will occur when new releases are swapped in. To swap in new data from the update system to the production system, you need to

  1. stop all work on the production machine,
  2. You now need to capture the assignments, annotations and subsystems work that has been done on the production machine. To do this, you need to know when the last production release was installed. Suppose that it was July 1, 2004. If that was the date, we recommend that you run

        extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004<
    


    This will capture your updates and save them in the directory /tmp/sync.data.july.1.2004.
  3. Now, you need to replace your Data directory (within FIGdisk/FIG) with the new version from the update system. We suggest that you do the following:
    1. archive the existing Data directory. These can usually be discarded within a month or two, but keeping them around is a good safety measure.
    2. move a copy of the update Data directory into the FIGdisk/FIG directory.
    At this point, you have a version of the data from the update system in the right location, but the internal databases all contain the old data.
  4. Now, run
    	fig load_all
    
    to reload the production databases with the data from the newly inserted Data directory. This will usually take several hours.
  5. Now, you need to capture the changes made to the old production version using something like
    	sync_new_system /tmp/sync.data.july.1.2004 make-assignments
    

  6. make the production machine available for use.
  7. You should now bring your update system to the same state as the production system. This can be done by making sure that /tmp/sync.data.july.1.2004 is accessible to the update system. If the production and update systems are run on the same machine, then the directory is already there. If not, copy it to /tmp on the update machine. Then run
    	sync_new_system /tmp/sync.data.july.1.2004 make-assignments
    

    on the update machine.
Our experience is that anytime a group wishes to share a common production environment, this 2-system approach is the way to do it. You can, if necessary, put both systems on the same physical machine. This does require some special handling in setting up two different FIGdisk directories. We recommend using FIGdisk.production and FIGdisk.update. However, in general it makes sense to use two separate physical machines, for backup if nothing else. The update system can usually be run on a $2000 (or less) box, although it is desirable to spend a little more and get at least 1 gigabyte of main memory and 200 gigabytes of external disk.

Adding a New Genome to an Existing SEED

To add a new genome to a running SEED is fairly easy, but there are a number of details that do have to be handled with care.

The first thing to note is that the SEED does not include tools to call genes -- you are expected to provide gene calls. This may change at some point, but for now you must call your own genes. A number of good tools now exist in the public domain, and you will need to find one that seems adequate for your needs.

Let us now cover how to prepare the actual data. You need to construct a directory (in somewhere like ~fig/Tmp) of the following form:
GenomeId of the form xxxx.y where xxxx is the taxon ID and y is an integer
PROJECT a file containg a description of the source of the data
GENOME a file containing a single line identifying the genus, species and strain
TAXONOMY a file containing a single line containing the NCBI taxonomy
RESTRICTIONS a file containing a description of distribution restrictions (optional)
CONTIGS contigs in fasta format
assigned_functions function assignments for the protein-encoding genes (optional)
Features
peg
tbl describes locations and aliases for the protein-encoding genes
fasta fasta file of translations of the protein-encoding genes
rna
tbl describes locations and aliases for the rna-encoding genes
fasta fasta file of the DNA corresponding to the genes


Let us expand on this very brief description:

  1. The name of the directory must be of the form xxxx.y where xxxx is the taxon ID, and y is a sequence number. For example, 562.1 might be used for E.coli, since 562 is the NCBI taxon ID for Escherichia coli. The sequence number (y) is used to distinguish multiple genomes having the same taxon ID.

  2. The assigned_functions file contains assignments of function for the protein-encoding genes. is of the form
    		Id\tFunction\tConfidence  (\t stands for a tab character)
    
    The Id must be a valid PEG Id. These are of the form:
    		fig|xxxx.y.peg.z
    
    where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes the peg (protein-encoding gene).
    Confidence is a single character code:
    The second tab and the confidence code can be omitted (it will default to a space). The assigned_functions file is optional. You can leave it blank and, after adding the genome to the SEED, ask for automated assignments.

  3. The tbl files specify the locations of genes, as well as any aliases. Each line in a tbl line is of the form
    	Id\tLocation\tAliases    (the aliases are separated by tabs)
    
    The Id must conform to the fig|xxxx.y.peg.z format described above. The Location is of the form
    	L1,L2,L3...Ln
    
    where each Li describes a region on a contig and is of the form
    
    	Contig_Begin_End where
    
    	      Contig is the Id of the contig,
    	      Begin is the position of the first character, and
    	      End is the position of the last character
    
    For example,
    fig|562.1.peg.15	Escherichia_coli_K12_14168_15295	dnaJ	b0015	sp|P08622	gi|16128009 
    
    describes the dnaJ gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12. The gene is from the genome 562.1, and it has 4 specified aliases.
  4. The fasta files must have gene Ids that match tbl file entries. The peg fasta file contains translations, while the rna fasta file contains DNA sequences.
  5. Both the peg and the rna subdirectories are optional.

The SEED provides a utility that can be used to produce such a directory from a GenBank entry. Thus,
	parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
would attempt to produce a properly formatted directory (~/Tmp/562.4) containing the data encoded in the GenBank entry from the file genbank.entry.for.a.new.E.coli.genome. This script is far from perfect, and there is huge variance in encodings in GenBank files. So, use it at your own risk (and, manually check the output).

You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory to see examples of how it should be done.

So, supposing that you have built a valid directory (say, /Users/fig/Tmp/562.4), you can add the genome using

	fig add_genome /Users/fig/Tmp/562.4

The add_genome request will add your new genome and queue a computational request that similarities be computed for the protein-encoding genes.

Computing Similarities

Adding a genome does not automatically get similarities computed for the new genome; it queues the request. To get the similarities actually computed, you need to establish a computational environment on which the blast runs will be made, and then initiate a request on the machine running the SEED.

This is not a completely trivial process because there are a variety of different ways to compute similarities:

  1. You can just compute them on the system running the SEED. This can take several days, but this is often a perfectly reasonable way to get the job done.
  2. Alternatively, you may be in an environment where you have a set of networked machines (say, 4-5 machines), and you wish to just exploit these machines to do the blast runs.
  3. Finally, you may be dealing with a large genome or genomes (and, hence, the need for many days of computation). In this case, it makes sense to utilize a large computational resource, and this resource may either be a local cluster or a service provided over the net.

To establish the flexibility needed to support all of these alternatives, we implemented the following approach:
We anticipate that at least one major center (Argonne National Lab) and, perhaps, more will create well-defined interfaces for handling high-volume requests. At FIG, we will maintain a set of instructions on how to set up your configuration to exploit these resources.

Deleting Genomes from a Version of the SEED

There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is when you wish to replace an existing version of a genome (in which case the replacement is viewed as first deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy of the SEED containing a subset of the entire collection of genomes.

To delete a set of genomes from a running version of the SEED, just use

	fig delete_genomes G1 G2 ...Gn  (where G1 G2 ... Gn designates a list of genomes)
For example,
	fig delete_genomes 562.1
could be used to delete a single genome with a genome ID of 562.1.

To make a copy with some genomes deleted to give to someone else requires a little different approach. To extract a set of genomes from an existing version of the SEED, you need to run the command

	extract_genomes Which ExistingData ExtractedData
The first argument is either the word "unrestricted" or the name of a file containing a list of genome IDs (the genomes that are to be retained in the extraction). The second argument is the path to the current Data directory. The third argument specifies the name of a directory that is created holding the extraction. Thus,
	extract_genomes unrestricted /Users/fig/FIGdisk/FIG/Data /Volumes/Tmp/ExtractedData
would created the extracted Data directory for you. If you wish to then produce a fully distributable version of the SEED from the existing version and the extracted Data directory, you would use
	make_a_SEED /Users/fig/FIGdisk /Volumes/Tmp/ExtractedData /Volumes/MyFriend/FIGdisk.ReadyToGo
	rm -rf /Volumes/Tmp/ExtractedData

Periodic Reintegration of Similarities

When the initial SEED was constructed, similarities were computed. For most similarities of the form "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2. This is not always true, since we truncate the number of similarities associated with any single Id (leaving us in a situation in which we may have similarity recorded for Id1, but not Id2). When a genome is added, if Id1 was an added protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2. This means that when looking at genes from previously existing organisms, you never get links back to the added pegs. This is not totally satisfactory.

Periodically, it is probably a good idea to "reinitegrate the similarities". This can be done by just running

        reintegrate_sims
#	update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
The job will probably run for quite a while (perhaps as much as a day or two).

Computing "Pins" and "Clusters"

The SEED displays potentially significant clusters on prokaryotic chromosomes. In the process of finding preserved contiguity, it computes "pins", which are simply a set of genes that are believed to be orthologs that cluster with similar genes. If you add your own genome, you will probably want to compute and enter these into the active database. This can be done using
	compute_pins_and_clusters G1 G2 G3 ...
where the arguments are genome Ids. Thus,
	compute_pins_and_clusters 562.4
would compute and add entries for all of the pegs in genome 562.4.