SEED Administration

This tutorial discusses a number of issues that you will need to know about in order to install, share, and maintain your SEED installation. It is organized as follows:

Backing Up Your Data

The data and code stored within the SEED are organized as follows:
	~fig				     on a Mac: /Users/fig; on Linux: /home/fig
		FIGdisk
			dist                 source code
			FIG
				Tmp          temporary files
				Data         data in readable form
  1. The directory FIGdisk holds both the code and data for the SEED. The data is loaded into a database system that stores the data in a location external to FIGdisk, but otherwise a running SEED is encapsulated within FIGdisk. A symbolic link to FIGdisk is maintained in the directory ~fig.
  2. Within FIGdisk there are a two key directories:

    1. dist contains the source code, and
    2. FIG contains the execution environment and Data.

  3. Within FIG, there are a number of directories. The most important are

    1. Data, which contains all of the data in a human-readable form, and

    2. Tmp, which contains the temporary files built by SEED in response to commands.

Hence, to backup your data, you should simply copy the Data directory. It should be backed up to a separate disk. Suppose that /Volumes/Backup is a backup disk. Then,
	cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup
	gzip -r /Volumes/Backup/Data.Backup

would be a reasonable way to make a backup. The copy preserves permissions, copies recursively, and does not follow symbolic links.

Copying a Version of the SEED

To make a second copy of the SEED (either for a friend or for yourself), you should use tar to preserve a few symbolic links (which are relative, not absolute; this means that they can be copied while still preserving the integrity of the whole system). So, suppose that you have a FIGdisk in /Volumes/From/FIGdisk.Jan8 and you wish to copy it to /Volumes/To. Use
   cd /Volumes/From
   tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)

This should produce the desired copy. In this case, suppose that we are in a Mac OS X environment, and From and To are firewire disks. To install the system on a friends Mac, you would unmount To, plug it into the new machine, and then set the symbolic link to the active FIGdisk using

cd ~fig  
rm FIGdisk # fails if there is no existing FIGdisk on the machine
ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk  
bash Switch to using the bash shell
cd FIGdisk  
cp CURRENT_RELEASE DEFAULT_RELEASE # Causes the new configuration to use the code that was running in the original installation
./configure arch-name # Configure the new SEED disk for architecture arch-name.
source config/fig-user-env.sh
# Set up the environment for using the SEED
start-servers
# Start the database server and registration servers
init_FIG
# Initialize a new relational database
fig load_all # Load the database from the SEED data files. This may take several hours

At this point, the new SEED copy should be ready to use. You only need to perform the configure, init_FIG, and fig load_all steps once after installing a new copy of the SEED. After a reboot or other clean start of the computer, you will only have to do these steps:

cd ~fig/FIGdisk  
bash Switch to using the bash shell
source config/fig-user-env.sh
# Set up the environment for using the SEED
start-servers
# Start the database server and registration servers

Upon setting up a new computer for running SEED, you should read the full documentation for SEED installation, as it has a number of platform-specific modifications that need to be performed. This document can currently be found at the following location in the SEED Wiki:

http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions

Running Multiple Copies of the SEED

For individual users that use the SEED to support comparative analysis, a single copy is completely adequate. Adding genomes can usually be done without disrupting normal use, and a very occasional major reorganization that runs over the weekend is not a big deal.

The situation is somewhat different when the system is being used to support a major sequencing/annotation effort. In this case, you have a user community that is sensitive to disruptions of service, and you have frequent demands to update versions of data. In this case, it is best to have two systems: the production system is used to support the larger user community, and the update system is used to prepare updated versions of the system. New genomes are added to the update system, and then periodically a revised Data directory is extracted to update the production system. Even so, work stoppages of a few hours will occur when new releases are swapped in.

This use of an "update" and a "production" system is quite analogous to running a production system which is occasionally updated from new Data DVDs (which FIG normally makes available about every 4-6 months). That is, in both cases you are updating a production system from a newly created Data directory that is lacking assignments and annotations that exist on your production system. However, if you have added new genomes to the production system (that are not part of the releases you may acquire via DVDs), you should get the new release, install the versions of your local genomes, and then do this update procedure.

The plan we propose is to build a completely encapsulated new version of the system, then capture updates from the old production system, update the new production system, and then make the new version the actual production system. This last step amounts to altering a symbolic link to point at the new production system rather than the old. This has the virtue of ease of recovery -- that is, if something goes wrong you can flip back to the old system. The actual steps are as follows:

  1. First, make sure that you are in the BASH shell by typing "echo $SHELL"; if the result is not "bash", type "bash" to enter the BASH shell.

  2. Next, check that the result of typing "which perl" is the version of perl owned by the SEED; it should look something like
           /Users/fig/FIGdisk/env/mac/bin/perl
       
    although the exact results will depend on where your existing copy of the SEED is installed, whether your platform is a Macintosh or LINUX, etc. If the result does not look similar to the above, type:
           source Path_to_FIGdisk/config/fig-user-env.sh
       
    to setup your FIG environment properly.

  3. Next, make a copy of the Code Distribution Environment (from a DVD or via the network). Suppose that we have made such a directory in CodeDistEnv. Then use,
    	cd CodeDistEnv
    	./install-code TargetDirectory
    
    where TargetDirectory is where you wish to build the new production version. We recommend calling it something like FIGdisk.July24.

  4. Stop all work on the production machine for the duration of the update. You do this by clicking on the "Seed Control Panel" link, and then entering an explanatory message in the text box and clicking on the "Disable SEED server" button.

  5. You now need to capture the assignments, annotations and subsystems work that has been done on the production machine. To do this, you need to know when the last production release was installed. Suppose that it was July 1, 2004. If that was the date, we recommend that you run
            extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004
         
    This will capture your updates and save them in the directory /tmp/sync.data.july.1.2004.

  6. Now, you need to stop the existing production system using
    	~/FIGdisk/bin/stop-servers
    

  7. Now, you need to configure the runtime environment for the system you are running on. To do this, use
    	cd TargetDirectory
    	./configure MacOrLinux
    
    where MacOrLinux must be a currently supported environment. Those that are supported on July 24, 2004 are mac for Macintoshes running panther, mac-jaguar for those that have not upgraded to panther, and linux-postgres.

  8. Now, you need to insert the new Data directory into the newly constructed version of the SEED. To do this use
    	chmod -R 777 TheNewData
    	cd TargetDirectory/FIG
    	ln -s TheNewData Data
    
    where TheNewData is the new Data directory, which normally comes from the update system. If you acquired a new Data directory via Data DVDs, you will need to unpack them using the README instructions, but what results is a new version of the Data directory.

  9. Now, you need to start the servers in order to load the databases with the new release using
    	cd TargetDirectory/bin
    	./start-servers
    	cd ..
    	source config/fig-user-env.sh
    	init_FIG
    	fig load_all
    
    This last command will run for several hours.

    (WARNING: Please note that, because the new SEED's databases do not yet exist, the `init_FIG` command will generate two totally harmless but rather terrrifying error messages the very first time it is executed, so that its output will look something like this:

    DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL:  Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
    
    Initializing new SEED database fig
    
    ERROR:  DROP DATABASE: database "fig" does not exist
    dropdb: database removal failed
    CREATE DATABASE
    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
    CREATE TABLE
    
    Complete. You will need to run "fig load_all" to load the data.
    
    We recognize that that generating the above two faux "FATAL" errors constitutes a rather ugly and inelegant implementation, but we have not yet found a more elegant database initialization method that can avoid generating them.)

  10. Now, you need to capture the changes made to the old production version using something like
             sync_new_system /tmp/sync.data.july.1.2004 make-assignments
         

  11. Run
    	index_annotations
    	index_subsystems
    	make_indexes
        

  12. Now, finally, you should alter the symbolic link in ~fig to the current FIGdisk using something like:
    	cd ~fig
    	rm FIGdisk     # should be removing a symbolic link to the current SEED
    	ln -s TargetDirectory FIGdisk
    
    That should make the new SEED the one available through the Web interface.

  13. You should now bring your update system to the same state as the production system. This can be done by making sure that /tmp/sync.data.july.1.2004 is accessible to the update system. If the production and update systems are run on the same machine, then the directory is already there. If not, copy it to /tmp on the update machine. Then run
    	 sync_new_system /tmp/sync.data.july.1.2004 make-assignments
         

    on the update machine.

Our experience is that anytime a group wishes to share a common production environment, this 2-system approach is the way to do it. You can, if necessary, put both systems on the same physical machine. This does require some special handling in setting up two different FIGdisk directories. We recommend using FIGdisk.production and FIGdisk.update. However, in general it makes sense to use two separate physical machines, for backup if nothing else. The update system can usually be run on a $2000 (or less) box, although it is desirable to spend a little more and get at least 1 gigabyte of main memory and 200 gigabytes of external disk.

Adding a New Genome to an Existing SEED

To add a new genome to a running SEED is fairly easy, but there are a number of details that do have to be handled with care.

The first thing to note is that the SEED does not include tools to call genes -- you are expected to provide gene calls. This may change at some point, but for now you must call your own genes. A number of good tools now exist in the public domain, and you will need to find one that seems adequate for your needs.

Let us now cover how to prepare the actual data. You need to construct a directory (in somewhere like ~fig/Tmp) of the following form:
GenomeId of the form xxxx.y where xxxx is the taxon ID and y is an integer
PROJECT a file containg a description of the source of the data
GENOME a file containing a single line identifying the genus, species and strain
TAXONOMY a file containing a single line containing the NCBI taxonomy
RESTRICTIONS a file containing a description of distribution restrictions (optional)
CONTIGS contigs in fasta format
assigned_functions function assignments for the protein-encoding genes (optional)
Features
peg
tbl describes locations and aliases for the protein-encoding genes
fasta fasta file of translations of the protein-encoding genes
rna
tbl describes locations and aliases for the rna-encoding genes
fasta fasta file of the DNA corresponding to the genes


Let us expand on this very brief description:

  1. The name of the directory must be of the form xxxx.y where xxxx is the taxon ID, and y is a sequence number. For example, 562.1 might be used for E.coli, since 562 is the NCBI taxon ID for Escherichia coli. The sequence number (y) is used to distinguish multiple genomes having the same taxon ID.

  2. The assigned_functions file contains assignments of function for the protein-encoding genes. is of the form
    		Id\tFunction\tConfidence  (\t stands for a tab character)
    
    The Id must be a valid PEG Id. These are of the form:
    		fig|xxxx.y.peg.z
    
    where xxxx.y is the genome Id, and z is an integer that uniquely distinguishes the peg (protein-encoding gene).
    Confidence is a single character code:
    The second tab and the confidence code can be omitted (it will default to a space). The assigned_functions file is optional. You can leave it blank and, after adding the genome to the SEED, ask for automated assignments.

  3. The tbl files specify the locations of genes, as well as any aliases. Each line in a tbl line is of the form
    	Id\tLocation\tAliases    (the aliases are separated by tabs)
    
    The Id must conform to the fig|xxxx.y.peg.z format described above. The Location is of the form
    	L1,L2,L3...Ln
    
    where each Li describes a region on a contig and is of the form
    
    	Contig_Begin_End where
    
    	      Contig is the Id of the contig,
    	      Begin is the position of the first character, and
    	      End is the position of the last character
    
    For example,
    fig|562.1.peg.15	Escherichia_coli_K12_14168_15295	dnaJ	b0015	sp|P08622	gi|16128009 
    
    describes the dnaJ gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12. The gene is from the genome 562.1, and it has 4 specified aliases.
  4. The fasta files must have gene Ids that match tbl file entries. The peg fasta file contains translations, while the rna fasta file contains DNA sequences.
  5. Both the peg and the rna subdirectories are optional.

The SEED provides a utility that can be used to produce such a directory from a GenBank entry. Thus,
	parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genome
would attempt to produce a properly formatted directory (~/Tmp/562.4) containing the data encoded in the GenBank entry from the file genbank.entry.for.a.new.E.coli.genome. This script is far from perfect, and there is huge variance in encodings in GenBank files. So, use it at your own risk (and, manually check the output).

You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory to see examples of how it should be done.

So, supposing that you have built a valid directory (say, /Users/fig/Tmp/562.4), you can add the genome using

	fig add_genome /Users/fig/Tmp/562.4

The add_genome request will add your new genome and queue a computational request that similarities be computed for the protein-encoding genes.

Importing External Protein Data

The presence of external judgements about the possible functions of encoded proteins is one of the essential aspects of the SEED. It becomes important that one be able to add new sources of annotation, as well as periodically updating the judgements of existing sources. To update the external sets of proteins and annotations, build a new nonredundant database of proteins, and compute the associated similarities, one should proceed as follows:
  1. Stop using the system until this procedure completes.

  2. Update the NR Directory

    The NR directory is located within the Data directory:
    	~fig				          on a Mac: /Users/fig; on Linux: /home/fig
    		FIGdisk
    			dist                      source code
    			FIG
    				Tmp               temporary files
    				Data              data in readable form
    				          NR	  Contains external Data
    					  
    
    The NR directory contains one subdirectory for each source of external assignments (the released SEED includes subdirectories for SwissProt, NCBI, UniProt, and KEGG). You may add more subdirectories.

    Each subdirectory must include 3 files:

    1. fasta should be a fasta file containing the protein sequences. These sequences will be used to establish a correspondence between these IDs and other protein sequences within the SEED.

    2. org.table is a two-column, tab-separated table. Column 1 is the ID, and column 2 is the organism corresponding to the ID.

    3. assign_functions is a 2-column table. The ID is in column 1, and column 2 contains the gene function (often called a product name) asserted by the external source.

    You should proceed only when you have updated as many of the sources as you wish.

  3. Now run
           import_external_sequences_step1
    
    This program will build a new nonredundant database, check to see what has changed, and will build the input required to compute new similarities.

  4. Compute the needed similarities You will need three files to compute a new batch of similarities. The locations of these three files are displayed by import_external_sequences_step1 just before completion (i.e., you should have gotten them as the output of the last step). Compute the similarities (see the discussion below) and store them in the NewSims directory (again the precise location was displayed by import_external_sequences_step1).

  5. Run
           import_external_sequences_step3
    

Computing Similarities

Adding a genome does not automatically get similarities computed for the new genome. To get the similarities actually computed, you need to compute them and make them available in the FIGdisk/FIG/Data/NewSims directory.

To compute similarities, you will need to do the following:

  1. The translations of the set of PEGs in your new genome (i.e., genome 562.4) should be in ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta. A copy of this was appended to ~fig/FIGdisk/FIG/Data/Global/nr when your genome was added. nr is the "nonredundant database" we use to compute similarities (and the one you must use). To get the initial blast results, you would use something like
              blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
    

    which produces the blast results in a tab-separated format. The invocation of reduce_sims is optional. It has the effect of limited the retained similarities for each PEG to 300, with a truncation approach that attempts to preserve at least one similarity against each other genome (i.e., the trimming is selective).
  2. The output of blastall lacks 2 columns that we need -- columns containing the length of each of the similar sequences. To add that, you would use
            reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
    

    This will actually append two columns to each similarity and place the results in the NewSims directory where it should be.

The above description will produce similarities using a single invocation of blastall. For most large genomes, and whenever you wish to process a batch of genomes, you should use parallel processing while maintaining the spirit of the approach. No matter how you produce the new similarities, they need to be added as a file in the FIGdisk/FIG/Data/NewSims directory. Then, you need to index these similarities using

	index_sims ~/FIGdisk/FIG/Data/NewSims/XXXX
where XXXX is the file you added. If you have more than one such file, just put in several arguments for the command. This will "index" the similarities in that any of the new PEGs which have similarities connecting them to other PEGs from the existing genomes can now be displayed. However, the connection from the existing genomes to the new PEGs does not yet exist (we call these the "flips" of the computed sims). To get this ability, you need to go through a process that will make your system unavailable for a period (and, it will produce a substantial load on your system for a day or so, while the SEED sorts, sifts, inserts, and generally plays with the "flips").
The extra steps you need to take to make a fully functional version are as follows:
  1. First, you need to run
    	update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*
    
    This should produce updated similarity files in a VERY BIG directory that we happened to put at ~/Tmp/FlippedSims (but, which you could put anywhere). This may run as much as a day or so (and you can watch its progress as it updates the similarity files).
  2. The next step is to replace the existing similarity files with the newly computed ones. You need to make the SEED unavailable (via the SEED Control Panel.
  3. Then, blow away the existing similarities using something like
    	rm ~/FIGdisk/FIG/Data/Sims/*
    	rm ~/FIGdisk/FIG/Data/NewSims/*
    	cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims
    	rm -r ~/Tmp/FlippedSims
    
    There are several ways to do this. You might want to save the old similarities somewhere. You might be able to move (rather than copy), the similarities. Whatever suits you.
  4. Then run
    	index_sims
    
    to re-index all of the similarities, and you should be fully operational.

Deleting Genomes from a Version of the SEED

There are two common instances in which one wishes to delete genomes from a running version of the SEED: one is when you wish to replace an existing version of a genome (in which case the replacement is viewed as first deleting the existing copy and then adding the new copy), and the second is when you wish to create a copy of the SEED containing a subset of the entire collection of genomes.

To delete a set of genomes from a running version of the SEED, just use

	fig mark_deleted_genomes User G1 G2 ...Gn  (where G1 G2 ... Gn designates a list of genomes)
For example,
	fig mark_deleted_genomes RossO 562.1
could be used to delete a single genome with a genome ID of 562.1.

Periodic Reintegration of Similarities

When the initial SEED was constructed, similarities were computed. For most similarities of the form "Id1 and Id2 are similar", entries were "recorded" for both Id1 and Id2. This is not always true, since we truncate the number of similarities associated with any single Id (leaving us in a situation in which we may have similarity recorded for Id1, but not Id2). When a genome is added, if Id1 was an added protein-encoding gene (peg), then the similarity is "recorded" for Id1 but not Id2. This means that when looking at genes from previously existing organisms, you never get links back to the added pegs. This is not totally satisfactory.

Periodically, it is probably a good idea to "reinitegrate the similarities". This can be done by just running

        reintegrate_sims
#	update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
The job will probably run for quite a while (perhaps as much as a day or two).

Computing "Pins" and "Clusters"

The SEED displays potentially significant clusters on prokaryotic chromosomes. In the process of finding preserved contiguity, it computes "pins", which are simply a set of genes that are believed to be orthologs that cluster with similar genes. If you add your own genome, you will probably want to compute and enter these into the active database. This can be done using
	compute_pins_and_clusters G1 G2 G3 ...
where the arguments are genome Ids. Thus,
	compute_pins_and_clusters 562.4
would compute and add entries for all of the pegs in genome 562.4.

Automatic Annotation of Genomes

The SEED provides a simple but limited capability for automated assignment of protein-encoding gene function based on similarity. Candidate functions are assigned scores based on the combined strengths of all BLASTP similarities to genes carrying that particular assignment, weighted by the provenance and assignment-confidence for each similar gene. The final automated function assignment is then determined from the list of candidate functions and their associated scores. Automated assignment is a four-step process:
  1. Create a list of PEGs to be automatically assigned. If one wishes to make assignments to an entire organism or set of organisms that are already installed in the SEED, the simplest method for creating this list is to type the following command:
        pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
    

  2. Next, create a list of candidate function-assignments using the following command:
       auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
    
    (NOTE: The `auto_assign` command has some additional optional parameters; for example, if one knows that all the PEGs in 'peg.list' are from prokaryotic organisms, one can make use of this additional informaation by invoking `auto_assign` as follows:
       auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
    
    Also, if one wishes to use an alternate file of similarity data named 'simfile' instead of the precomputed similarities stored in the SEED, one can instead type:
       auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcs
    
    Finally, `auto_assign` can read a set of alternate parameters from a file, but we recommend that you stick with the default settings, and not exploit this last feature unless you are a qualified SEED wizard.)

  3. Next, create a SEED format assigned-functions file as follows:
        make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
    
    Alternately, if you wish to suppress the class of "non-informative" function assignments such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect., you may do so using the '-no_hypos' flag:
        make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
    
  4. Finally, install the automated assignments in the seed using the command
        fig assign_functionF master:automated_assignments  ~/Tmp/assigned_functions
    
It should be once again noted that the SEED's automated assignment algorithm is quite simple and crude, being only slightly better than simply assigning the function of the highest-scoring BLASTP hit; however, it at least provides a "quick and dirty" starting point for making an initial assessment of a genome, which may then be cleaned up and refined by skilled genome annotators.