Accessing SEED Services

Introduction

This document is designed to help a programmer understand how to access the services provided by the SEED. It is not designed for the user who wishes to confine his experience to working with the SEED browser over the Web.

Before we launch into a discussion of the details of how to invoke specific routines, there are some background issues that you should know about. First, let's talk about fig, which is a command-line utility that is used for a number of purposes. Its most basic purpose is to exercise the API and illustrate what data is accessed by each operation. However, you may also find it extremely useful as a quick way to invoke services. To try it out, type

        fig
This should produce a prompt of the form
        ??
You should type
        ?? h<cr>
to get a listing of the commands supported by the utility. At the time this document was written (May, 2004), the output was as follows:
?? h
    DB                              Current DB
    abbrev                          genome_name
    access_sims                     [expand|func|expandfunc]       (for timeing - access 10 distinct sims)
    add_annotation                  FID User [ prompted for annotation; terminate with "." at start of line ]
    add_chromosomal_clusters        File
    add_genome                      GenomeDir
    add_pch_pins                    File
    add_peg_link                    PEG Link (e.g., <a href=http//...>link to somewhere</a>)
    all_compounds               
    all_features                    GenomeID Type
    all_maps                 
    all_protein_families
    all_reactions
    all_roles
    all_sets                        Relation SetName
    assign_function                 PEG User [conf=X] Function
    assign_functionF                User File [no_annotations]
    assignments_made                who date G1 G2 ...  [ none => all ]
    auto_assign                     PEG [Seq]
    auto_assignF                    User FileOfPEGs
    auto_assignG                    User GenomeID1 GenomeID2 ...
    bbhs                            PEG Cutoff
    between                         n1 n2 n3
    blast                           PEG  [against nr]
    blastit                         PEG seq db maxP
    boundaries_of                   Loc
    build_tree_of_complete          min_for_label   
    by_alias                        alias
    by_fig_id                       FID1 FID2
    candidates_for_role             Genome Cutoff Role
    cas                             CID
    cas_to_cid                      cas
    catalyzed_by                    RID
    catalyzes                       role
    cgi_url
    clean_tmp
    close_genes                     FID distance
    comp2react                      CID
    contig_ln                       GenomeID Contig
    coupling_and_evidence           FID Bound SimCutoff CouplingCutoff
    crude_estimate_of_distance      GenomeID1 GenomeID2
    delete_all_peg_links            PEG
    delete_genomes                  G1 G2 G3 ...Gn
    delete_peg_link                 PEG URL
    displayable_reaction            RID
    dna_seq                         GenomeID Loc
    dsims                           FastaFile [MaxN MaxPsc Select]   *** select defaults to raw
    ec_to_maps                      EC
    ec_name                         EC
    expand_ec                       EC
    epoch_to_readable               [gives readable version of current time]
    export_chromosomal_clusters
    export_pch_pins
    export_set                      Relation SetName File
    exportable_subsystem            Subsystem
    external_calls                  PEG1 PEG2 PEG3...
    extract_seq                     ContigsFile loc 
    all_exchangable_subsystems
    best_bbh_candidates             Genome Cutoff Requested Known
    family_function                 Family
    fast_coupling                   FID Bound CouplingCutoff
    feature_aliases                 FID
    feature_annotations             FID
    feature_location                FID 
    file2N                          File
    ftype                           FID 
    function_of                     ID [all] [trans]
    genes_in_region                 GenomeID Contig Beg End 
    genome_of                       PEG
    genomes                         [complete|Pat]
    genome_counts                   [complete]
    genome_version                  GenomeID
    genus_species                   GenomeID
    get_translation                 ID
    get_translations                File [tab]
    h                               [command]  *** h h for help on how to use help ***
    hypo                            Function
    ids_in_family                   Family
    ids_in_set                      WhichSet Relation SetName
    in_cluster_with                 PEG
    in_family                       FID
    in_pch_pin_with                 FID
    in_sets                         Which Relation SetName
    is_archaeal                     GenomeID
    is_bacterial                    GenomeID
    is_eukaryotic                   GenomeID
    is_prokaryotic                  GenomeID
    is_exchangable_subsystem        Subsystem
    is_real_feature                 FID   
    largest_clusters                FileOfRoles user 
    load_all
    map_to_ecs                      map
    map_name                        map
    mapped_prot_ids                 ID
    maps_to_id                      ID 
    max                             n1 n2 n3 ...
    merged_related_annotations      PEG1 PEG2 ... PEGn
    min                             n1 n2 n3 ...
    names_of_compound
    neighborhood_of_role            role        
    org_of                          prot_id
    pegs_of                         GenomeID
    possibly_truncated              FID
    reaction2comp                   RID
    related_by_func_sim             PEG user
    reversible                      RID  
    rnas_of                         GenomeID
    roles_of_function               function
    same_func                       'function1'   'function2'
    search_index                    pattern
    seqs_with_role                  role who
    seqs_with_roles_in_genomes      genomes roles made_by
    sims                            PEG maxN maxP select
    sort_fids_by_taxonomy           FID1 FID2 FID3 ...
    sort_genomes_by_taxonomy        GenomeID1 GenomeID2 GenomeID3 ...
    sz_family                       family
    taxonomic_groups_of_complete    min_for_labels
    taxonomy_of                     GenomeID
    translatable                    prot_id
    translate_function              PEG user 
    translated_function_of          PEG user
    translation_length              ID
    unique_functions                PEGs user  [make the pegs comma separated]
    verify_dir                      dir
    
If you quickly browse through the commands that are supported by fig, you should get a fairly complete overview of the services provided by the API. We are not going to cover the use of fig in detail in this tutorial, but we do encourage you to experiment with it. By the way, you should type x to exit from it.

fig is ususally accessed in interactive mode (invoked with no arguments). However, if you wish to invoke it to process a single command, you can just put the command as a single argument on the command-line. Thus,

  fig 'add_peg_link fig|188.1.peg.96 <a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/>animation</a>'
would add a link to PEG fig|188.1.peg.96 while
  fig 'delete_all_peg_links fig|188.1.peg.96'
would delete all links from the PEG. Since adding links to PEGs is something you might wish to do, the easiest way to do it is to just create a file of commands of the sort shown, make the file executable, and then run it.

If you are a Perl programmer, eventually you will wish to peruse the fig source code to see how it is structured, how to add to it, and how to invoke services via the API.

We have briefly discussed the utility fig. Now it is time to discuss writing your own programs using the API. Basically, the entire API is encapsulated in the Perl module FIG.pm. To illustrate the style of programming we recommend, let's create a simple program to add a link to a PEG (rather than using fig). Adding a link to a PEG will cause the link to be displayed whenever anyone examines the PEG via the SEED web services.. The API defines a set of services that do many things; adding a link to a PEG is just one very minor service, but understanding exactly how one could invoke such a service is what this tutorial is about. To begin, use

        tool_hdr add_link
This creates a file called add_link and makes it executable. The file contains an initial set of Perl commands designed to establish the correct execution environment (where the libraries are, and so forth). You need to add your code to the end of the initialized file. We recommend adding the following lines:
use FIG;
my $fig = new FIG;

$usage = "usage: add_link PEG Link";

(
 ($peg  = shift @ARGV) &&
 ($link = shift @ARGV)
)
    || die $usage;

$fig->add_peg_links([$peg,$link]);

A short program to add a link to a PEG


We urge you to study this short program carefully, since it reflects the style we will use throughout this tutorial. There are a number of things worth noting: We suggest that you type in this program and get it to run on your version of the SEED. Then, use fig to delete any links you added.

Accessing Data Relating to Genomes

Let us now begin a somewhat more systematic exploration of the services offered by the API. We will begin with genome-related services, of which there are relatively few in the SEED. Essentially, genomes are represented by IDs of the form xxx.y where xxx is the NCBI taxon ID, and y is a suffix to disambiguate situations in which you have multiple genomes with the same taxon ID.

genomes.ex.pl genomes.ex.py

use strict;
use FIG;
my $fig = new FIG;

my(%sofar,$genome,$which,$abbrev,$version);

foreach $genome ($fig->genomes("complete"))
{
    $which   = $fig->genus_species($genome);
    $abbrev  = &get_unique_abbreviation($which,\%sofar);
    $version = $fig->genome_version($genome);
    print "$genome\t$abbrev\t$version\t$which\n";
}

sub get_unique_abbreviation {
    my($which,$sofar) = @_;
    my($nonunique,$n);

    $nonunique = &FIG::abbrev($which);
    $nonunique =~ s/ //g;
    $nonunique =~ s/\.+$//;
    $n = $sofar->{$nonunique};
    if (! defined($n))
    {
        $n = $sofar->{$nonunique} = 1;
    }
    $sofar->{$nonunique}++;
    return $nonunique . ".$n";
}
from FigKernelPackages import FIG
import sys
import string


def die(msg):
	print msg
	sys.exit(0)

fig = FIG.FIG()
sofar = {}

def get_unique_abbreviation(which, sofar):
    nonunique = fig.abbrev(which);
    nonunique = string.rstrip(string.replace(nonunique," ",  ""), ".")
    if not sofar.has_key(nonunique):
	sofar[nonunique] = 1
    n = sofar[nonunique]
    sofar[nonunique] = sofar[nonunique] + 1
    return (nonunique+"."+str(n))

for genome in fig.genomes("complete"):
    which   = fig.genus_species(genome)
    abbrev  = get_unique_abbreviation(which,sofar);
    version = fig.genome_version(genome);
    print "%s\t%s\t%s\t%s" % (genome, abbrev, version, which)

An illustration of genome-related services


There are a few points worth noting about the short example: Here are just a few lines from the output of this little program:
197221.1        The.elon.BP.1   197221.1.3406724106     Thermosynechococcus elongatus BP-1
198094.1        Bac.anth.st.2   198094.1.2593352402     Bacillus anthracis str. Ames
198214.1        Shi.flex.2a.1   198214.1.1464268764     Shigella flexneri 2a str. 301
198215.1        Shi.flex.2a.2   198215.1.1756034760     Shigella flexneri 2a str. 2457T
What other genome-related services does the SEED offer? The only ones worth mentioning here relate to contigs. Here is a short example illustrating how to use those services. Basically, you can
  1. get a list of the contigs in a given genome,
  2. get the length of a specified contig, and
  3. extract sequence from a contig.
To illustrate these features, here is a short program that list all of the contigs for a designated genome. For each contig that is over 50 characters in size (and, they all should be), it displays the contig ID, the contig length, and the first 50 characters of sequence.

contigs_for_genome.pl contigs_for_genome.py

use strict;
use FIG;
my $fig = new FIG;

my $usage = "usage: contigs_for_genome Genome";

my($genome,@contigs,$contig,$len,$seq);

($genome  = shift @ARGV) 
    || die $usage;

@contigs = $fig->all_contigs($genome);
foreach $contig (@contigs)
{
    $len = $fig->contig_ln($genome,$contig);
    if ($len >= 50)
    {
        $seq = $fig->dna_seq($genome,$contig . "_1_50");
        print "$contig\t$len\t$seq\n";
    }
}

from FigKernelPackages import FIG
import sys

fig = FIG.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 2:
	die("usage: contigs_for_genome Genome")

genome = sys.argv[1]

contigs =  fig.all_contigs(genome)

for contig in contigs:
    len = fig.contig_ln(genome,contig)
    if len >= 50:
        seq = fig.dna_seq(genome,contig+"_1_50")
        print contig+"\t"+len+"\t"+seq

How to access services related to contigs


The services being used in this short example are as follows: This last service (dna_seq) requires some explanation. You need to know that
  1. Locations on a contig begin with 1 (not 0 as some computer scientists would prefer).
  2. Within the SEED, locations are represented as a comma-separated list of contiguous regions. A contiguous region is represented by a contig ID, a beginning position, and an ending position all separated by underscores. One fact that can be a little confusing is that contig IDs themselves may contain underscores. Thus,
            NC_000913_1_50
    
    would be used to represent the first fifty chracters of the contig NC_000913.
  3. Sections of sequence on the complementary strand are represented similarly (with the beginning point greater than the ending position). Thus,
            NC_000913_50_1
    
    would represent the sequence on the complementary strand of NC_000913 going from position 50 to position 1.
  4. The routine dna_seq will accept a location (i.e., a comma-separated list of contiguous regions), or even just a list of arguments each representing a contiguous region, and return the concatenation of the sequences from those locations. Thus,
            $fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100")
    or
            $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")
    
    would return 99 characters formed by deleting character 51 from the first 100 characters of NC_000913.

Accessing Data Relating to Features

The SEED was designed, like WIT and ERGO before it, to allow a variety of types of features. In fact, the SEED is essentially focused on protein-encoding genes, which I perversely named pegs rather than CDSs. It is true that WIT contained RNAs, inteins, introns, and repeat sequences, and ERGO contained RNAs and IS elements; however, the SEED really only has pegs and RNAs at this point (and, the RNAs are not adequately handled).

There is a fairly rich set of services for pegs, as you might guess from the web services offered by the SEED. As an introduction to some of these capabilities, consider the following short program:

features_around.pl features_around.py

use strict;
my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg);
my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region);
my($start_of_leftmost_feature,$end_of_rightmost_feature);
my($genome);

use FIG;
my $fig = new FIG;

$usage = "usage: features_around ID";

($id = shift @ARGV)
    || die $usage;

if ($peg = $fig->by_alias($id))
{
    if ($loc = $fig->feature_location($peg))
    {
        ($contig,$beg,$end) = $fig->boundaries_of($loc);
        if (defined($contig))
        {
            $start_reg = &FIG::min($beg,$end) - 5000;
            $end_reg   = &FIG::max($beg,$end) + 5000;
            $genome = &FIG::genome_of($peg);
            ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
                $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
            foreach $fid (@$features_in_region)
            {
                $loc1     = $fig->feature_location($fid);
                $aliases1 = $fig->feature_aliases($fid);
                $trunc    = $fig->possibly_truncated($fid);

                $pseq = $func = $prot_ln = "";

                if (&FIG::ftype($fid) eq "peg")
                {
                    if ($prot_ln = $fig->translation_length($fid))
                    {
                        $pseq = $fig->get_translation($fid);
                        $func = $fig->function_of($fid);
                    }
                }
                print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
            }
        }
    }
    else
    {
        print STDERR "Sorry, could not get the location of $fid\n";
    }
}
else
{
    print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
}
from FigKernelPackages import FIG
import sys
import string

fig = FIG.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 2:
	die("usage: features_around ID")

id = sys.argv[1]

peg = fig.by_alias(id)
if peg:
    loc = fig.feature_location(peg)
    if loc:
	try:
	    (contig,beg,end) = fig.boundaries_of(loc)
            start_reg = int(fig.min(beg,end)) - 5000
            end_reg   = int(fig.max(beg,end)) + 5000
            genome = fig.genome_of(peg)
            (features_in_region,start_of_leftmost_feature,end_of_rightmost_feature) = fig.genes_in_region(genome,contig,start_reg,end_reg);
            for fid in features_in_region:
                loc1     = fig.feature_location(fid)
                aliases1 = fig.feature_aliases(fid)
                trunc    = fig.possibly_truncated(fid)

                pseq = func = prot_ln = ""

                if fig.ftype(fid) == "peg":
                    prot_ln = fig.translation_length(fid)
                    if prot_ln:
                        pseq = fig.get_translation(fid)
                        func = fig.function_of(fid)
		print string.join([fid, loc1, str(aliases1), str(trunc), str(func), str(prot_ln), pseq], "\t")
                #print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
	except:
		pass
    else:
	sys.stderr.write("Sorry, could not get the location of %s\n" % fid)
        #print STDERR "Sorry, could not get the location of $fid\n";
else:
    sys.stderr.write("Sorry, could not figure out which PEG you meant by %s\n" % id)
    #print STDERR "Sorry, could not figure out which PEG you meant by $id\n";

Displaying features close to a given feature


This short program takes as an argument an ID that corresponds to a feature. The ID can be a FIG feature ID of the form
        fig|xxxx.y.peg.zzzz
where xxxx.y is the genome ID and zzzz is a unique sequence number (these usually occur in the same order as the genes occur on the contigs, but this is not guaranteed to be true). The argument could also be any other ID known by the SEED that uniquely corresponds to the FIG ID. For example,
        features_around 'sp|Q9TL34'
would display twelve features (fig|31312.1.peg.1 through fig|31312.1.peg.12, since sp|Q9TL34 corresponds to fig|31312.1.peg.5. The program outputs a tab-separated file of the sort one might use as input to a spreadsheet like Excel. The columns in the output contain:
  1. the FIG ID,
  2. the location of the feature,
  3. a list of aliases for the gene,
  4. a flag indicating whether or not the gene occurs near the end of the contig (and, hence, is potentially truncated),
  5. the function currently assigned to the gene,
  6. the length of the protein sequence encoded by the gene, and
  7. the protein sequence encoded by the gene.
Of course, some of the features might not be PEGs, and in that case the last three fields are just empty.

Now, let's go through some of the details in the code:

The SEED comes with a directory containing the example programs we are displaying. If you find it hard to follow my comments or feel that I have failed left out some detail that you need to understand, please
  1. contact me (Ross@TheFIG.info), and
  2. copy the program and play with it. The most basic form of playing is to just do things like inserting
            print &Dumper($some_returned_variable);
    
    to see precisely what comes back.

Sequence Data

You have already been introduced to sequence data. You know how to access the DNA from a location in a contig from a designated genome, you know you to get locations of features, and you know how to acquire the translations for PEGs. In this short section we discuss just a few utilities that you might find useful.

The next short example illustrates how to access the reverse complement of a DNA sequence (i.e., the opposite strand), how to translate DNA sequences, and how to write out sequences in FASTA format. Let us look at it before thinking about additional complexities like how to handle nonstandard translations or properly translating the first codon:

seq_utils.pl seq_utils.py

use FIG;
my $fig = new FIG;

# seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr

$usage = "usage: seq_utils Genome Location";

(
 ($genome = shift @ARGV) &&
 ($loc  = shift @ARGV) 
)
    || die $usage;

if ($seq = $fig->dna_seq($genome,$loc))
{
    $seqR = &FIG::reverse_comp($seq);

    $tran  = &FIG::translate($seq);
    $tranR = &FIG::translate($seqR);

    &FIG::display_id_and_seq("plus_strand",\$seq);
    &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);

    &FIG::display_id_and_seq("minus_strand",\$seqR);
    &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
}

from FigKernelPackages import FIG
import sys

fig = FIG.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 3:
	die("usage: seq_utils Genome Location")

genome = sys.argv[1]
loc = sys.argv[2]

seq = fig.dna_seq(genome, loc)

if seq:
    seqR = fig.reverse_comp(seq);

    tran  = fig.translate(seq);
    tranR = fig.translate(seqR);

    fig.display_id_and_seq("plus_strand",seq);
    #fig.display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);

    fig.display_id_and_seq("minus_strand",seqR);
    #fig.display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);

Using some simple sequence utilities


The program takes two command-line arguments: a genome ID and a location. The program is supposed to write out the DNA from both the plus and minus strands from the given location to STDOUT, and it writes the translations of both strands to STDERR. Thus,
        seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
will write the following contents to the file stdout:
>plus_strand
ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG
GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT
GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA
TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA
CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA
GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG
GCTGGTGCAACTGTCACAGTGAAA
>minus_strand
TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC
TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT
CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG
CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC
AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC
TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT
AATTTGATTTGTAATAGTAGACAT
and the following two sequences to the file stderr
>plus_strand_translated
MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE
FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE
AGATVTVK
>minus_strand_translated
FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW
HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN
NLICNSRH
Now let us look at the routines within the program that produced these results: Since maps and subsystems are remarkably similar notions (both are sets of functional roles), you could also construct a similar program for subsystems. In the SEED, subsystems do not have IDs which are associated with full names; rather, the full name is itself the ID of the subsystem. With this in mind, the following program should be reasonably clear:

connect_subsystem_to_genes.pl connect_subsystem_to_genes.py

use FIG;
my $fig = new FIG;
use Sim;

use strict;
my($usage,$subsystem,$genome,$name,$role,$peg,$func);

$usage = "usage: connect_subsystem_to_genes Subsystem Genome";

(
 ($subsystem    = shift @ARGV) &&
 ($genome       = shift @ARGV)
)
    || die $usage;

print "$subsystem\n";
foreach $role ($fig->subsystem_to_roles($subsystem))
{
    print "  $role\n";
    foreach $peg ($fig->seqs_with_role($role,"master",$genome))
    {
	$func = $fig->function_of($peg);
	print "    $peg\t$func\n";
    }
}
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 3:
	die("usage: connect_subsystem_to_genes Subsystem Genome")

subsystem  = sys.argv[1]
genome = sys.argv[2]

print subsystem
roles = fig.subsystem_to_roles(subsystem)
for role in roles:
    print role
    pegs = fig.seqs_with_role(role,"master",genome)
    for peg in pegs:
	func = fig.function_of(peg);
	print "%s\t%s" % ( peg, func[0][1])

Locating genes within an organism that connect to a specific subsystem


Invoking this program using
        connect_subsystem_to_genes 'Embden-Meyerhof and Gluconogenesis' 83333.1
would produce the following output:

Note that the functional roles in subsystems tend to be far more
detailed (and almost never just an EC number).

The problem with the above program is that we have chosen to view a subsystem as a spreadsheet in which each cell corresponds to a specific functional role in a designated organism. The gene functions of the PEGs contained in the cell usually connect to the specific role, but this is not forced. If someone changes the function of a gene, it does not disappear from the spreadsheet, unless the spreadsheet is rebuilt by forcing the entries to reflect current assignments and how they connect to the roles. We have chosen this approach to allow curators of spreadsheets to have precise control over the contents. People who import a subsystem may or may not instal the gene assignments needed to make the spreadsheet consistent, but the spreadsheet reflects the curators best assessment of which genes implement specific roles. This leads to my assertion that the following program is a better way to display a subsystem:

connect_subsystem_to_genes_the_right_way.pl connect_subsystem_to_genes_the_right_way.py

use FIG;
my $fig = new FIG;
use Sim;

use strict;
my($usage,$subsystem,$genome,$name,$role,$peg,$func);

$usage = "usage: connect_subsystem_to_genes_the_right_way Subsystem Genome";

(
  ($subsystem    = shift @ARGV) &&
  ($genome       = shift @ARGV)
 )
    || die $usage;

print "$subsystem\n";
foreach $role ($fig->subsystem_to_roles($subsystem))
{
    print "  $role\n";
    foreach $peg ($fig->pegs_in_subsystem_cell($subsystem, $genome, $role))
    {
	$func = $fig->function_of($peg);
	print "    $peg\t$func\n";
    }
}
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 3:
	die("usage: connect_subsystem_to_genes_the_right_way Subsystem Genome")

subsystem = sys.argv[1]
genome  = sys.argv[2]

print subsystem
roles = fig.subsystem_to_roles(subsystem)

for role in roles:
    print role
    pegs = fig.pegs_in_subsystem_cell(subsystem, genome, role)
    for peg in pegs: 
	func = fig.function_of(peg)
	print "    %s\t%s" % (peg, func[0][1])

Locating genes within an organism that connect to a specific subsystem the right way


We have just illustrated how to connect maps to functional roles and functional roles to gene functions. Now we need to go the other way. That is, given a gene function, what roles would it connect to, and which maps and subsystems contain those roles? The following program illustrates the services required to answer this question;

gf2maps_and_subs.pl gf2maps_and_subs.py
use FIG;
my $fig = new FIG;

use strict;
my($usage,$peg,$func,@roles,$role,@maps,$map);
my(%in_map,@maps_it_connects_to,@subsystems_it_is_in);
my($map_name,$subsystem);

$usage = "usage: gf2maps_and_subs PEG";

# The single input argument must be a FIG id, which is used to get the assigned function.
# The program then prints a list of maps to which the gene can be connected,
# followed by a list of subsystems that contain the gene.

($peg = shift @ARGV)
    || die $usage;

if (($func = $fig->function_of($peg)) && (! &FIG::hypo($func)))
{
    @roles = $fig->roles_of_function($func);
    foreach $role (@roles)
    {
	@maps = $fig->role_to_maps($role);
	foreach $map (@maps)
	{
	    $in_map{$map} = 1;
	}
    }

    @maps_it_connects_to       = sort keys(%in_map);

    if (@maps_it_connects_to > 0)
    {
	print "Maps that connect to $func\n\n";
	foreach $map (@maps_it_connects_to)
	{
	    $map_name = $fig->map_name($map);
	    print "$map\t$map_name\n";
	}
	print "\n";
    }

}
else
{
    print STDERR "you need to give a valid PEG with a non-hypothetical function to connect to maps\n";
}

@subsystems_it_is_in = $fig->peg_to_subsystems($peg);

if (@subsystems_it_is_in > 0)
{
    print "Subsystems that contain $peg\n\n";
    foreach $subsystem (@subsystems_it_is_in)
    {
	print "$subsystem\n";
    }
}
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 2:
	die("usage: gf2maps_and_subs PEG")

# The single input argument must be a FIG id, which is used to get the assigned function.
# The program then prints a list of maps to which the gene can be connected,
# followed by a list of subsystems that contain the gene.

peg = sys.argv[1]

func = fig.function_of(peg)
func = func[0][1]
hypo = fig.hypo(func)
in_map = {}

if func and hypo[0] == 0: 
    roles = fig.roles_of_function(func)
    for role in roles:
	maps = fig.role_to_maps(role)
	for map in maps:
	    in_map[map] = 1

    maps_it_connects_to = in_map.keys()
    maps_it_connects_to.sort()

    if len(maps_it_connects_to) > 0:
	print "Maps that connect to %s\n" % func
	for map in maps_it_connects_to:
	    map_name = fig.map_name(map)
	    print "%s\t%s" % (map, map_name[0])
	print " ";
else:
    sys.stderr.write("you need to give a valid PEG with a non-hypothetical function to connect to maps\n")

subsystems_it_is_in = fig.peg_to_subsystems(peg)

if len(subsystems_it_is_in) > 0:
    print "Subsystems that contain %s\n" % peg
    for subsystem in subsystems_it_is_in:
	print subsystem

Going from a PEG to the maps and subsystems containing it


Here, again, there are some of points worth discussing: If you take a few minutes out and think about maps and subsystems, you will see that they are very, very similar concepts that are treated quite differently by the SEED.

Annotations

Annotations are time-stamped text notes made by a named user and attached to specific features. The most common are notes recording when someone made an assignment of function to a PEG, but they can be used to record anything. The usual way to access the annotations attached to a given feature with an ID designated by $fid is to use
        @annotations = $fig->feature_annotations($fid)
The returned list contains 4-tuples (i.e., pointers to lists, each of which contains 4 entries). Each 4-tuple contains
  1. the FID (this is redundant, but it was convenient for reasons that will become apparent),
  2. a timestamp,
  3. the name of the user that made the annotation, and
  4. the annotation in the form of a text string.
The following short program uses
        @annotations = $fig->merged_related_annotations($fig_ids)
which is a useful generalization. This routine takes a pointer to a list of feature IDs as input, and it produces a merged list of the annotations attached to all of the features. This can often be very convenient. For example, you might have a set of similar PEGs that are all misannotated, and you wish to reconstruct the steps that led to the current situation. By merging the annotations (which normally record both who made the annotations and why), you can get a record of the events in chronilogical order. The following short example illustrates this utility:

annotations_of.pl annotations_of.py
use FIG;
my $fig = new FIG;
use strict;

my($usage,$fig_ids,@annotations);

$usage = "usage: annotations_of Fid1 Fid2 ...";

(@ARGV > 0)
    || die $usage;

$fig_ids = [@ARGV];

@annotations = $fig->merged_related_annotations($fig_ids);
if (@annotations > 0)
{
    &display_annotations(\@annotations);
}
else
{
    print STDERR "Sorry, there were no annotations attached to the features\n";
}

sub display_annotations {
    my($annotations) = @_;
    my($annotation,$fid,$timestamp,$user,$text);

    foreach $annotation (@$annotations)
    {
	($fid,$timestamp,$user,$text) = @$annotation;
	print "=======\n";
	print "$user\t$timestamp\t$fid\n$text\n";
	print "=======\n\n";
    }
}
from FigKernelPackages import FIG
import sys

def display_annotations(annotations):
    for annotation in annotations:
	fid,timestamp,user,text = annotation
	print "=======\n";
	print "%s\t%s\t%s\n%s\n" % ( user, timestamp, fid, text)
	print "=======\n\n"

fig = FIG.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) < 2:
	die("usage: annotations_of Fid1, Fid2, ... ")

fig_ids = sys.argv[1:]

annotations = fig.merged_related_annotations(fig_ids)
if len(annotations) > 0:
    display_annotations(annotations)
else:
    sys.stderr.write("Sorry, there were no annotations attached to the features\n")

Accessing the annotations from a set of features


Invoking this little utility using
        annotations_of 'fig|207559.1.peg.735' 'fig|12149.1.peg.2133' 'fig|1525.1.peg.1757' 
produced the following output:
=======
automated_assignment    Thu Mar 18 17:42:51 2004        fig|12149.1.peg.2133
Set master function to
ATP phosphoribosyltransferase (EC 2.4.2.1
=======

=======
automated_assignment    Thu Mar 18 17:45:16 2004        fig|1525.1.peg.1757
Set master function to
ATP phosphoribosyltransferase (EC 2.4.2.1
=======

=======
last_release    Fri Apr  2 14:14:07 2004        fig|207559.1.peg.735
Set master function to
ATP phosphoribosyltransferase (EC 2.4.2.1
=======

Assigning Gene Functions

We can now easily talk about assigning functions and making annotations to record the events. Here is a simple utility that can be used to make annotations:

batch_assignments.pl
use FIG;
my $fig = new FIG;

use strict;
my($usage,$user,$annotation,$text,$peg,$func,$conf,$user_name,$funcO);

$usage = "usage: batch_assignments User [AnnotationFile] < Assignments";

($user = shift @ARGV)
    || die $usage;

if ($annotation = shift @ARGV)
{
    $text = join("",`cat $annotation`);
}
else
{
    $text = "";
}

while (defined($_ = ))
{
    chop;
    ($peg,$func,$conf) = split(/\t/,$_);
    if (! $conf) { $conf = "" }
    if ($user =~ /master:(.*)/)
    {
	$user_name = $1;
	$funcO = $fig->function_of($peg);
	if ($funcO ne $func)
	{
	    $fig->assign_function($peg,"master",$func,$conf);
	    $fig->add_annotation($peg,$user_name,"Set master function to\n$func\n$text");
	}
    }
    else
    {
	$funcO = $fig->function_of($peg,$user);
	if ($funcO ne $func)
	{
	    $fig->assign_function($peg,$user,$func,$conf);
	    $fig->add_annotation($peg,$user,"Set function to\n$func\n$text");
	}
    }
}
    

A utility to make batch assignments


Before discussing the details of this code, we should discuss the issue of "master and non-master users". Each gene can have a single master assignment and any number of non-master assignments. A user who has signed in as master:USER (e.g., "master:RossO") is a master user. If a master user makes an assignment, it overwrites the master assignment. If a non-master user makes an assignment, it alters only the assignment corresponding to that user, if there is one. When a master asks for the function of a gene, he gets the master assignment. If a non-master asks for the function of a gene, he will get the assignment corresponding to his user id, if there is one; if not, he will get the master assignment, if there is one.

There is also the matter of confidence levels. The SEED supports attaching confidence codes to assignments, although they are seldom used. The codes we use at this point are: