[Bio] / FigTutorial / API.html Repository:
ViewVC logotype

View of /FigTutorial/API.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (download) (as text) (annotate)
Sat Aug 23 23:39:22 2008 UTC (11 years, 2 months ago) by golsen
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.3: +257 -153 lines
Move writing of assignment annotation from a case-by-case basis (and it
was missing in several key places) to the FIG::assign_function.

Modify the code in each of the calling locations to not make duplicate
annotations.

At the same time, remove (most) of the instances of making different
calls to assign_function depending on the user name.  assign_function
treats everyone as master (but writes an annotation with the real user
name).

<h1>Accessing SEED Services</h1>
<h2>Introduction</h2>
This document is designed to help a programmer understand how to
access the services provided by the SEED.  It is not designed for the
user who wishes to confine his experience to working with the SEED
browser over the Web.
<p>
Before we launch into a discussion of the details of how to invoke
specific routines, there are some background issues that you should
know about.  First, let's talk about <b>fig</b>, which is a
command-line utility that is used for a number of purposes.  Its most
basic purpose is to exercise the API and illustrate what data is
accessed by each operation.  However, you may also find it extremely
useful as a quick way to invoke services.  To try it out, type
<pre>
        fig
</pre>
This should produce a prompt of the form
<pre>
        ??
</pre>
You should type 
<pre>
        ?? h&lt;cr&gt;
</pre>
to get a listing of the commands
supported by the utility.  
At the time this document was written (May, 2004), the output was as follows:
<pre>
<b>?? h</b>
    DB                              Current DB
    abbrev                          genome_name
    access_sims                     [expand|func|expandfunc]       (for timeing - access 10 distinct sims)
    add_annotation                  FID User [ prompted for annotation; terminate with "." at start of line ]
    add_chromosomal_clusters        File
    add_genome                      GenomeDir
    add_pch_pins                    File
    add_peg_link                    PEG Link (e.g., &lt;a href=http//...&gt;link to somewhere&lt;/a&gt;)
    all_compounds               
    all_features                    GenomeID Type
    all_maps                 
    all_protein_families
    all_reactions
    all_roles
    all_sets                        Relation SetName
    assign_function                 PEG User [conf=X] Function
    assign_functionF                User File [no_annotations]
    assignments_made                who date G1 G2 ...  [ none => all ]
    auto_assign                     PEG [Seq]
    auto_assignF                    User FileOfPEGs
    auto_assignG                    User GenomeID1 GenomeID2 ...
    bbhs                            PEG Cutoff
    between                         n1 n2 n3
    blast                           PEG  [against nr]
    blastit                         PEG seq db maxP
    boundaries_of                   Loc
    build_tree_of_complete          min_for_label   
    by_alias                        alias
    by_fig_id                       FID1 FID2
    candidates_for_role             Genome Cutoff Role
    cas                             CID
    cas_to_cid                      cas
    catalyzed_by                    RID
    catalyzes                       role
    cgi_url
    clean_tmp
    close_genes                     FID distance
    comp2react                      CID
    contig_ln                       GenomeID Contig
    coupling_and_evidence           FID Bound SimCutoff CouplingCutoff
    crude_estimate_of_distance      GenomeID1 GenomeID2
    delete_all_peg_links            PEG
    delete_genomes                  G1 G2 G3 ...Gn
    delete_peg_link                 PEG URL
    displayable_reaction            RID
    dna_seq                         GenomeID Loc
    dsims                           FastaFile [MaxN MaxPsc Select]   *** select defaults to raw
    ec_to_maps                      EC
    ec_name                         EC
    expand_ec                       EC
    epoch_to_readable               [gives readable version of current time]
    export_chromosomal_clusters
    export_pch_pins
    export_set                      Relation SetName File
    exportable_subsystem            Subsystem
    external_calls                  PEG1 PEG2 PEG3...
    extract_seq                     ContigsFile loc 
    all_exchangable_subsystems
    best_bbh_candidates             Genome Cutoff Requested Known
    family_function                 Family
    fast_coupling                   FID Bound CouplingCutoff
    feature_aliases                 FID
    feature_annotations             FID
    feature_location                FID 
    file2N                          File
    ftype                           FID 
    function_of                     ID [all] [trans]
    genes_in_region                 GenomeID Contig Beg End 
    genome_of                       PEG
    genomes                         [complete|Pat]
    genome_counts                   [complete]
    genome_version                  GenomeID
    genus_species                   GenomeID
    get_translation                 ID
    get_translations                File [tab]
    h                               [command]  *** h h for help on how to use help ***
    hypo                            Function
    ids_in_family                   Family
    ids_in_set                      WhichSet Relation SetName
    in_cluster_with                 PEG
    in_family                       FID
    in_pch_pin_with                 FID
    in_sets                         Which Relation SetName
    is_archaeal                     GenomeID
    is_bacterial                    GenomeID
    is_eukaryotic                   GenomeID
    is_prokaryotic                  GenomeID
    is_exchangable_subsystem        Subsystem
    is_real_feature                 FID   
    largest_clusters                FileOfRoles user 
    load_all
    map_to_ecs                      map
    map_name                        map
    mapped_prot_ids                 ID
    maps_to_id                      ID 
    max                             n1 n2 n3 ...
    merged_related_annotations      PEG1 PEG2 ... PEGn
    min                             n1 n2 n3 ...
    names_of_compound
    neighborhood_of_role            role        
    org_of                          prot_id
    pegs_of                         GenomeID
    possibly_truncated              FID
    reaction2comp                   RID
    related_by_func_sim             PEG user
    reversible                      RID  
    rnas_of                         GenomeID
    roles_of_function               function
    same_func                       'function1'   'function2'
    search_index                    pattern
    seqs_with_role                  role who
    seqs_with_roles_in_genomes      genomes roles made_by
    sims                            PEG maxN maxP select
    sort_fids_by_taxonomy           FID1 FID2 FID3 ...
    sort_genomes_by_taxonomy        GenomeID1 GenomeID2 GenomeID3 ...
    sz_family                       family
    taxonomic_groups_of_complete    min_for_labels
    taxonomy_of                     GenomeID
    translatable                    prot_id
    translate_function              PEG user 
    translated_function_of          PEG user
    translation_length              ID
    unique_functions                PEGs user  [make the pegs comma separated]
    verify_dir                      dir
    
</pre>
If you quickly browse through the commands that are supported by <b>fig</b>,
you should get a fairly complete overview of the services provided by the API.  
We are not going to cover the use of <b>fig</b> in detail in this
tutorial, but we do encourage you to experiment with it.  By the way,
you should type <b>x</b> to exit from it.
<p>
<b>fig</b> is ususally accessed in interactive mode (invoked with no
arguments).  However, if you wish to invoke it to process a single
command, you can just put the command as a single argument on the
command-line.  Thus,
<pre>
  fig 'add_peg_link fig|188.1.peg.96 &lt;a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/&gt;animation&lt;/a&gt;'
</pre>
would add a link to PEG <i>fig|188.1.peg.96</i> while
<pre>
  fig 'delete_all_peg_links fig|188.1.peg.96'
</pre>
would delete all links from the PEG.  Since adding links to PEGs is
something you might wish to do, the easiest way to do it is to just
create a file of commands of the sort shown, make the file executable,
and then run it.
<p>
If you are a Perl programmer, eventually you will wish to peruse the
<b>fig</b> source code to see how it is structured, how to add to it,
and how to invoke services via the API.
<p>
We have  briefly discussed the utility <b>fig</b>.  Now it is time to
discuss writing your own programs using the API.  Basically, the
entire API is encapsulated in the Perl module <b>FIG.pm</b>.  To illustrate
the style of programming we recommend, let's create a simple program
to add a link to a PEG (rather than using <b>fig</b>).  
Adding a link
to a PEG will cause the link to be displayed whenever anyone examines
the PEG via the SEED web services..  
The API defines a set of services that do many things; adding a link
to a PEG is just one very minor service, but understanding exactly how one
could invoke such a service is what this tutorial is about.
To begin, use
<pre>
        tool_hdr add_link
</pre>
This creates a file called <i>add_link</i> and makes it executable.
The file contains an initial set of Perl commands designed to
establish the correct execution environment (where the libraries are,
and so forth).  You need to add your code to the end of the
initialized file.  We recommend adding the following lines:
<hr>
<pre>
use FIG;
my $fig = new FIG;

$usage = "usage: add_link PEG Link";

(
 ($peg  = shift @ARGV) &&
 ($link = shift @ARGV)
)
    || die $usage;

$fig->add_peg_links([$peg,$link]);
</pre>
<h3>A short program to add a link to a PEG</h3>
<hr>
We urge you to study this short program carefully, since it reflects
the style we will use throughout this tutorial.  There are a number of
things worth noting:
<ul>
<li>
The lines
<pre>
use FIG;
my $fig = new FIG;
</pre>
initialize <i>$fig</i>.  This is your key to the fig services.  All
API services are invoked using
<pre>
        $fig->rtn(arguments)
</pre>
<li>
The last line of the program is the actual invocation of the service
to add a link to a peg.  Actually, <i>add_peg_links</i> is a service
routine that takes any number of arguments.  Each argument is a
2-tuple (i.e., a pointer to a list containing two elements).  Each
2-tuple contains a PEG and a link.  Links should be of the form:
<pre>
        &lt;a href=URL&gt;LABEL&lt;/a&gt;       
</pre>
</ul>

We suggest that you type in this program and get it to run on your
version of the SEED.  Then, use <b>fig</b> to delete any links you
added. 

<h2>Accessing Data Relating to Genomes</h2>

Let us now begin a somewhat more systematic exploration of the services offered by the API.
We will begin with genome-related services, of which there are relatively few in the SEED.
Essentially, genomes are represented by IDs of the form <i>xxx.y</i>
where <i>xxx</i> is the NCBI taxon ID, and <i>y</i> is a suffix to
disambiguate situations in which you have multiple genomes with the
same taxon ID.  
<p>
<blockquote>
<table border="1">
<tr>
<td>genomes.ex.pl</td>
<td>genomes.ex.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>

use strict;
use FIG;
my $fig = new FIG;

my(%sofar,$genome,$which,$abbrev,$version);

foreach $genome ($fig->genomes("complete"))
{
    $which   = $fig->genus_species($genome);
    $abbrev  = &get_unique_abbreviation($which,\%sofar);
    $version = $fig->genome_version($genome);
    print "$genome\t$abbrev\t$version\t$which\n";
}

sub get_unique_abbreviation {
    my($which,$sofar) = @_;
    my($nonunique,$n);

    $nonunique = &FIG::abbrev($which);
    $nonunique =~ s/ //g;
    $nonunique =~ s/\.+$//;
    $n = $sofar->{$nonunique};
    if (! defined($n))
    {
        $n = $sofar->{$nonunique} = 1;
    }
    $sofar->{$nonunique}++;
    return $nonunique . ".$n";
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG

import sys

import string





def die(msg):

	print msg

	sys.exit(0)



fig = FIG.FIG()

sofar = {}



def get_unique_abbreviation(which, sofar):

    nonunique = fig.abbrev(which);

    nonunique = string.rstrip(string.replace(nonunique," ",  ""), ".")

    if not sofar.has_key(nonunique):

	sofar[nonunique] = 1

    n = sofar[nonunique]

    sofar[nonunique] = sofar[nonunique] + 1

    return (nonunique+"."+str(n))



for genome in fig.genomes("complete"):

    which   = fig.genus_species(genome)

    abbrev  = get_unique_abbreviation(which,sofar);

    version = fig.genome_version(genome);

    print "%s\t%s\t%s\t%s" % (genome, abbrev, version, which)



</pre></td>
</tr>
</table>
</blockquote>

<h3>An illustration of genome-related services</h3>

<hr>
There are a few points worth noting about the short example:
<ul>
<li>

<i>$fig->genomes</i> returns a list of all of the genome IDs.  It has
two arguments that can be omitted.  The first is used to request only
complete (or near complete) genomes (this is used in the example).
For each genome in the SEED there is a directory in
<i>$FIG_Config::organisms</i> that encapsulates data about the genome. 
Genomes will be considered "complete" iff there exists a file
<i>$FIG_Config::organisms/$genome/COMPLETE</i> (where $genome contains the
genome ID).  The second argument can be used to restrict the returned
list of genomes to those that have restrictions on distribution
(indicated by the presence of
<i>$FIG_Config::organisms/$genome/RESTRICTIONS</i>). Thus,
<i>$fig->genomes("complete","restricted")</i> could be used to get a
list of complete, but restricted, genomes.
<li>
There are a number of routines contained in <i>FIG.pm</i> which are
not really services, but are there because they are simply offer
convenient subroutines.  They probably should not be part of an API.
For example <i>&FIG::between($x,$y,$z)</i> can be invoked to return
"true" iff <i>$y</i> is between <i>$x</i> and <i>$z</i>.
<i>&FIG::abbrev($which)</i> invokes a simple routine that produces an
abbrevbiation of a genome description.  It will usually contain
spaces and sometimes trailing periods, so if you want an abbreviation without these, you have to get
rid of them.  Further, the abbreviations are not unique (that is,
invoking this routine for two distinct genomes can produce the same
result).  Hence, we give a simple little routine that produces unique
abbreviations (suitable, for example, to use for sequences in an
alignment). 
<li>
Finally, it is worth noting that, as of now, we think of a genome
"version" as composed of a genome ID concatenated with a checksum.
This may or may not change in the future.
</ul>
Here are just a few lines from the output of this little program:
<pre>
197221.1        The.elon.BP.1   197221.1.3406724106     Thermosynechococcus elongatus BP-1
198094.1        Bac.anth.st.2   198094.1.2593352402     Bacillus anthracis str. Ames
198214.1        Shi.flex.2a.1   198214.1.1464268764     Shigella flexneri 2a str. 301
198215.1        Shi.flex.2a.2   198215.1.1756034760     Shigella flexneri 2a str. 2457T
</pre>
What other genome-related services does the SEED offer?  The only ones
worth mentioning here relate to <i>contigs</i>.  Here is a short
example illustrating how to use those services.  Basically, you can
<ol>
<li>
get a list of the contigs in a given genome,
<li>
get the length of a specified contig, and
<li>
extract sequence from a contig.
</ol>
To illustrate these features, here is a short program that list all of
the contigs for a designated genome.  For each contig that is over 50
characters in size (and, they all should be), it displays the contig
ID, the contig length, and the first 50 characters of sequence.
<p>
<blockquote>
<table border="1">
<tr>
<td>contigs_for_genome.pl</td>
<td>contigs_for_genome.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>

use strict;
use FIG;
my $fig = new FIG;

my $usage = "usage: contigs_for_genome Genome";

my($genome,@contigs,$contig,$len,$seq);

($genome  = shift @ARGV) 
    || die $usage;

@contigs = $fig->all_contigs($genome);
foreach $contig (@contigs)
{
    $len = $fig->contig_ln($genome,$contig);
    if ($len >= 50)
    {
        $seq = $fig->dna_seq($genome,$contig . "_1_50");
        print "$contig\t$len\t$seq\n";
    }
}

</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG

import sys



fig = FIG.FIG()



def die(msg):

	print msg

	sys.exit(0)



if len(sys.argv) != 2:

	die("usage: contigs_for_genome Genome")



genome = sys.argv[1]



contigs =  fig.all_contigs(genome)



for contig in contigs:

    len = fig.contig_ln(genome,contig)

    if len >= 50:

        seq = fig.dna_seq(genome,contig+"_1_50")

        print contig+"\t"+len+"\t"+seq



</pre></td>
</tr>
</table>
</blockquote>

<h3>How to access services related to contigs</h3>
<hr>
The services being used in this short example are as follows:
<ul>
<li>
<i>$fig->all_contigs($genome)</i> returns a list of the contigs that
make up the designated genome,
<li>
<i>$fig->contig_ln($genome,$contig)</i> returns the length of the
designated contig, and
<li>
<i>$fig->dna_seq($genome,$contig . "_1_50")</i> returns the first 50
characters of the contig.
</ul>
This last service (<i>dna_seq</i>) requires some explanation.  You
need to know that
<ol>
<li>
Locations on a contig begin with 1 (not 0 as some computer scientists
would prefer).
<li>
Within the SEED, locations are represented as a comma-separated list
of <i>contiguous regions</i>.  A contiguous region is represented by a
contig ID, a beginning
position, and an ending position all separated by underscores.  One
fact that can be a little confusing is that contig IDs themselves may
contain underscores. Thus,
<pre>
        NC_000913_1_50
</pre>
would be used to represent the first fifty chracters of the contig
NC_000913.
<li>
Sections of sequence on the complementary strand are represented
similarly (with the beginning point greater than the ending
position).  Thus,
<pre>
        NC_000913_50_1
</pre>
would represent the sequence on the complementary strand of NC_000913
going from position 50 to position 1.
<li>
The routine <i>dna_seq</i> will accept a location (i.e., a
comma-separated list of contiguous regions), or even just a list of
arguments each representing a contiguous region, and return
the concatenation of the sequences from those locations.  Thus,
<pre>
        $fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100")
or
        $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")
</pre>
would return 99 characters formed by deleting character 51 from the
first 100 characters of NC_000913.
</ol>

<h2>Accessing Data Relating to Features</h2>

The SEED was designed, like WIT and ERGO before it, to allow a variety
of types of features.  In fact, the SEED is essentially focused on
protein-encoding genes, which I perversely named <i>pegs</i> rather
than <i>CDSs</i>.  It is true that WIT contained RNAs, inteins,
introns, and repeat sequences, and ERGO contained RNAs and IS
elements; however, the SEED really only has pegs and RNAs at this
point (and, the RNAs are not adequately handled).
<p>
There is a fairly rich set of services for pegs, as you might guess
from the web services offered by the SEED.  As an introduction to some
of these capabilities, consider the following short program:

<p>
<blockquote>
<table border="1">
<tr>
<td>features_around.pl</td>
<td>features_around.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>

use strict;
my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg);
my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region);
my($start_of_leftmost_feature,$end_of_rightmost_feature);
my($genome);

use FIG;
my $fig = new FIG;

$usage = "usage: features_around ID";

($id = shift @ARGV)
    || die $usage;

if ($peg = $fig->by_alias($id))
{
    if ($loc = $fig->feature_location($peg))
    {
        ($contig,$beg,$end) = $fig->boundaries_of($loc);
        if (defined($contig))
        {
            $start_reg = &FIG::min($beg,$end) - 5000;
            $end_reg   = &FIG::max($beg,$end) + 5000;
            $genome = &FIG::genome_of($peg);
            ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
                $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
            foreach $fid (@$features_in_region)
            {
                $loc1     = $fig->feature_location($fid);
                $aliases1 = $fig->feature_aliases($fid);
                $trunc    = $fig->possibly_truncated($fid);

                $pseq = $func = $prot_ln = "";

                if (&FIG::ftype($fid) eq "peg")
                {
                    if ($prot_ln = $fig->translation_length($fid))
                    {
                        $pseq = $fig->get_translation($fid);
                        $func = $fig->function_of($fid);
                    }
                }
                print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
            }
        }
    }
    else
    {
        print STDERR "Sorry, could not get the location of $fid\n";
    }
}
else
{
    print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG

import sys

import string



fig = FIG.FIG()



def die(msg):

	print msg

	sys.exit(0)



if len(sys.argv) != 2:

	die("usage: features_around ID")



id = sys.argv[1]



peg = fig.by_alias(id)

if peg:

    loc = fig.feature_location(peg)

    if loc:

	try:

	    (contig,beg,end) = fig.boundaries_of(loc)

            start_reg = int(fig.min(beg,end)) - 5000

            end_reg   = int(fig.max(beg,end)) + 5000

            genome = fig.genome_of(peg)

            (features_in_region,start_of_leftmost_feature,end_of_rightmost_feature) = fig.genes_in_region(genome,contig,start_reg,end_reg);

            for fid in features_in_region:

                loc1     = fig.feature_location(fid)

                aliases1 = fig.feature_aliases(fid)

                trunc    = fig.possibly_truncated(fid)



                pseq = func = prot_ln = ""



                if fig.ftype(fid) == "peg":

                    prot_ln = fig.translation_length(fid)

                    if prot_ln:

                        pseq = fig.get_translation(fid)

                        func = fig.function_of(fid)

		print string.join([fid, loc1, str(aliases1), str(trunc), str(func), str(prot_ln), pseq], "\t")

                #print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";

	except:

		pass

    else:

	sys.stderr.write("Sorry, could not get the location of %s\n" % fid)

        #print STDERR "Sorry, could not get the location of $fid\n";

else:

    sys.stderr.write("Sorry, could not figure out which PEG you meant by %s\n" % id)

    #print STDERR "Sorry, could not figure out which PEG you meant by $id\n";

</pre></td>
</tr>
</table>
</blockquote>
<h3>Displaying features close to a given feature</h3>
<hr>
This short program takes as an argument an ID that corresponds to a
feature.  The ID can be a FIG feature ID of the form
<pre>
        fig|xxxx.y.peg.zzzz
</pre>
where <i>xxxx.y</i> is the genome ID and <i>zzzz</i> is a unique
sequence number (these usually occur in the same order as the genes
occur on the contigs, but this is not guaranteed to be true).  The
argument could also be any other ID known by the SEED that uniquely
corresponds to the FIG ID.  For example,
<pre>
        features_around 'sp|Q9TL34'
</pre>

would display twelve features (<i>fig|31312.1.peg.1</i> through
<i>fig|31312.1.peg.12</i>, since <i>sp|Q9TL34</i> corresponds to
<i>fig|31312.1.peg.5</i>.  The program outputs a tab-separated file
of the sort one might use as input to a spreadsheet like Excel.  The
columns in the output contain:

<ol>
<li>
the FIG ID,
<li>
the location of the feature,
<li>
a list of aliases for the gene,
<li>
a flag indicating whether or not the gene occurs near the end of the
contig (and, hence, is potentially truncated), 
<li>
the function currently assigned to the gene,
<li>
the length of the protein sequence encoded by the gene, and
<li>
the protein sequence encoded by the gene.
</ol>
Of course, some of the features might not be PEGs, and in that case
the last three fields are just empty.
<p>
Now, let's go through some of the details in the code:
<ul>
<li>
The programs begins by trying to locate the FIG ID corresponding to
the ID given as the command-line argument.  The expression
<pre>
        $fig->by_alias($id)
</pre>
returns the FIG ID when it can be determined (otherwise,
<i>undef</i>).
<li>
Once the FIG ID for the PEG has been determined, the location is
requested using
<pre>
        $fig->feature_location($peg)
</pre>
Remember, a location is a sequence of contiguous regions separated by
commas (often a single contiguous region); a contiguous region
amounts to a contig ID, a beginning position, and an ending position
separated by underscores.  In some contexts, all you want is the
region containing a gene (without the details of intron/exon
boundaries).  In this case, you can use
<pre>
        ($contig,$beg,$end) = $fig->boundaries_of($loc);
</pre>
to convert the location to a single contiguous region.  These will
succeed only if the sequence of contiguous regions all have the same
contig ID.  <i>$beg</i> will be set to the start of the first exon,
while <i>$end</i> will be set to the end of the last exon.
<li>
Once the boundaries of the gene have been determined, the program asks
for all genes that overlap a region starting 5000 characters (bases,
if you like) to the left of the gene and ending 5000 characters to the
right of the gene.  This is done by using
<pre>
        $genome = &FIG::genome_of($peg);
</pre
to get the genome of the feature (<i>$fig->genome_of($peg)</i> would
work, as well), and then using
<pre>
            ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
                $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
</pre>
to extract the set of genes that overlap the region specified by the
four arguments <i>$genome, $contig, $start_reg, </i>and
<i>$end_reg</i>.  A list of three things is returned by the
invocation:
<ol>
<li>
<i>$features_in_region</i> will be set to a pointer to a list of the
features (represented by FIG IDs) that overlap the specified region.
<li>
<i>$start_of_leftmost_feature</i> will be set to the "leftmost"
position occupied by one of the features that is returned.

<i>$end_of_rightmost_feature</i> will be set to the "rightmost"
position occupied by one of the features that is returned.
</ol>
Thus, if you were writing a graphics module to display the features,
you would need to show the region from
<i>$start_of_leftmost_feature</i> to <i>$end_of_rightmost_feature</i>
if you wished to capture the features without truncation.
<li>
Once the features have been acquired, the program has a loop that
simply extracts data about each feature and prints it.
<pre>
                $loc1     = $fig->feature_location($fid);
                $aliases1 = $fig->feature_aliases($fid);
                $trunc    = $fig->possibly_truncated($fid);
</pre>
shows how to get the location of a feature, a comma-separated list of
aliases, and an indication of whether or not the feature might be
truncated (this is a rather simple check to just see if the feature
occurs within a few hundred bases from the end of the contig).
The use of 
<pre>
        &FIG::ftype($fid)
</pre>
just extracts the type of the feature (which is almost always
<i>peg</i> or <i>rna</i>, although we certainly hope that we will have
a richer set of feature types in the not too distant future).
The two invocations
<pre>
        $prot_ln = $fig->translation_length($fid)
        $pseq    = $fig->get_translation($fid)
</pre>
show how to get the length of the protein sequence encoded by a PEG
and the actual sequence itself.
Finally,
<pre>
        $func = $fig->function_of($fid)
</pre>
is used to get the current master function assignment for the gene.
The SEED does support non-master assignments, although their use has
been discouraged.  the original intent was to allow inexperienced
users to make assignments that did not overwrite the "master"
assignment.  Users who identify themselves as <i>master:someone</i>
will only make assignments that overwrite the master assignment.
However, a user without the "master:" prefix can make an assignment
that is retained seprately.  To access such an assignment, you would
use
<pre>
        $func = $fig->function_of($fid,"someone")
</pre>
If <i>someone</i> has not made an assignment to <i>$fig</i>, this will
return the master assignment.
</ul>

The SEED comes with a directory containing the example programs we are
displaying.  If you find it hard to follow my comments or feel that I
have failed left out some detail that you need to understand, please
<ol>
<li>
contact me (Ross@TheFIG.info), and
<li>
copy the program and play with it.  The most basic form of playing is
to just do things like inserting
<pre>
        print &Dumper($some_returned_variable);
</pre>
to see precisely what comes back.
</ol>

<h2>Sequence Data</h2>

You have already been introduced to sequence data.  You know how to
access the DNA from a location in a contig from a designated genome,
you know you to get locations of features, and you know how to acquire
the translations for PEGs.  In this short section we discuss just a
few utilities that you might find useful.
<p>
The next short example illustrates how to access the reverse
complement of a DNA sequence (i.e., the opposite strand), how to
translate DNA sequences, and how to write out sequences in FASTA
format.
Let us look at it before thinking about additional complexities like
how to handle nonstandard translations or properly translating the
first codon:

<p>
<blockquote>
<table border="1">
<tr>
<td>seq_utils.pl</td>
<td>seq_utils.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>

use FIG;
my $fig = new FIG;

# seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr

$usage = "usage: seq_utils Genome Location";

(
 ($genome = shift @ARGV) &&
 ($loc  = shift @ARGV) 
)
    || die $usage;

if ($seq = $fig->dna_seq($genome,$loc))
{
    $seqR = &FIG::reverse_comp($seq);

    $tran  = &FIG::translate($seq);
    $tranR = &FIG::translate($seqR);

    &FIG::display_id_and_seq("plus_strand",\$seq);
    &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);

    &FIG::display_id_and_seq("minus_strand",\$seqR);
    &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
}

</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG

import sys



fig = FIG.FIG()



def die(msg):

	print msg

	sys.exit(0)



if len(sys.argv) != 3:

	die("usage: seq_utils Genome Location")



genome = sys.argv[1]

loc = sys.argv[2]



seq = fig.dna_seq(genome, loc)



if seq:

    seqR = fig.reverse_comp(seq);



    tran  = fig.translate(seq);

    tranR = fig.translate(seqR);



    fig.display_id_and_seq("plus_strand",seq);

    #fig.display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);



    fig.display_id_and_seq("minus_strand",seqR);

    #fig.display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);

</pre></td>
</tr>
</table>
</blockquote>

<h3>Using some simple sequence utilities</h3>
<hr>
The program takes two command-line arguments: a
genome ID and a location.  
The program is supposed to write out the DNA from both the plus and
minus strands from the given location to STDOUT, and it writes the
translations of both strands to STDERR.
Thus,
<pre>
        seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
</pre>
will write the following contents to the file <i>stdout</i>:
<pre>
>plus_strand
ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG
GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT
GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA
TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA
CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA
GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG
GCTGGTGCAACTGTCACAGTGAAA
>minus_strand
TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC
TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT
CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG
CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC
AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC
TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT
AATTTGATTTGTAATAGTAGACAT
</pre>
and the following two sequences to the file <i>stderr</i>
<pre>
>plus_strand_translated
MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE
FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE
AGATVTVK
>minus_strand_translated
FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW
HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN
NLICNSRH
</pre>
Now let us look at the routines within the program that produced these
results:
<ul>
<li>
You have already seen the use of something like
<pre>
        $seq = $fig->dna_seq($genome,$loc)
</pre>
to extract the DNA.  The actual location can specify sequence from
either strand (remember how?).  
<li>
The we use
<pre>
        $seqR = &FIG::reverse_comp($seq)
</pre>
to get the reverse complement of the sequence held in <i>$seq</i>.
<li>
The program uses
<pre>
    $tran  = &FIG::translate($seq);
    $tranR = &FIG::translate($seqR);
</pre>
to get translations of both strands.  Note that these invocations use
the standard genetic code to do the translation.  The <i>translate</i>
service supports two optional arguments.  The second argument is used
to pass a pointer to a hash containing the desired translation, if you wish to use
a nonstandard code.  The keys of the hash should be

<pre>
        AAA => 
        AAC =>
        AAG =>
        .
        .
        .
        TTT =>
</pre>
The values are what the codons are translated to (normally, from the 
set {A,C,G,T}, but it is essentially unrestricted).  If the third
argument evaluates to <i>true</i> (i.e., defined, not null string, and
not 0), then initial GTG and TTG codons will be translated to "M".
<li>
The final little utility we employ is used to write the FASTA files:
<pre>
    &FIG::display_id_and_seq("plus_strand",\$seq);
    &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);

    &FIG::display_id_and_seq("minus_strand",\$seqR);
    &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
</pre>
The first argument is the <i>id</i> you wish, the second is a pointer
to the sequence to write out, and the third argument (which is
optional) is a file handle.  It defaults to writing to STDOUT.

<h2>Accessing Similarities</h2>

A great deal of comparative analysis relates to use of detected
similarities.  The SEED includes precomputed similarities for each
protein sequence.  Basically these similarities are computed as
follows:
<ol>
<li>
First, a nonredundant protein sequence data base is formed by taking
all protein sequences from a number of sources, including the
organisms loaded into the SEED (but, including sequences from external
sources that are not found in the SEED organisms).  Sets of protein
sequences that differ only in start calls are collapsed into single
entries in the nonredundant database (the longer sequences are
retained, and a table of synonyms is constructed showing which IDs
correspond to each nonredundant entry, and which of the merged IDs
actually represent shorter sequences.
<li>
Then all of the sequences in the nonredundant database are blasted
against one another.  For each sequence, there may be only a few or a
great many similarities.  We keep only the best similarity between any
two sequences.  Further, we set a maximum number of similarities
retained for each sequence to "about 300".  We try to keep a set that
includes at least the best similarity between the given sequence and
sequences from any specific organism.  We do not guarantee that we even
succeed in that.  So, you should think of the precomputed similarities
as a potentially useful, but quite truncated, set.  We are moving
towards a situation in which relationships between closely related sequences are saved in
the form of trees (and the corresponding similarities are discarded).
<li>
When a user requests the similarities to a given PEG, first the SEED
must figure out which entry in the nonredundant protein sequence
collection corresponds to the PEG (it might actually be the
translation corresponding to the PEG, or it might be another sequence
ID corresponding to a longer sequence from, say, Swiss Prot).  Then,
the block of similarities corresponding to the representative ID is
extracted.  Each of the similarities in this set really represents a
collection of similarities (one corresponding to each ID in the set of
sequences that were collapsed into a single representative ID).  For
each of the "raw" similarities, you can choose to expand it into the
entire set it represents or not, as you wish.  The expansion actually
takes much more time than actually extracting the block of
similarities, which is an irritating fact (which may be correctable by
some bright young computer scientists, but for now...).  So, it is
important that you understand the two concepts:
<ul>
<li>
A <i>raw similarity</i> is a similarity between two members of the
nonredundant protein collection.
<li>
The act of <i>expanding a similarity</i> amounts to taking a
similarity between the given ID (ususally called <i>id1</i>) and
another (usually called <i>id2</i>) and expanding it by considering
all ids that were collapsed to <i>id2</i>.
</ul>
<li>
Finally, you need to know that when you request similarities against a
given id, you get to specify how many of the similarities you wish
expanded.  This is useful, since one might wish to see the expanded
set for, say, the closest five, while getting several hundred
unexpanded similarities.
</ol>
<br>
Here is a short program that illustrates some of these comments.  It
takes as input arguments
<ol>
<li>
a given PEG id,
<li>
a similarity threshhold (minimum p-score),
<li>
the maximum number of similarities to be returned, and
<li>
the number of similarities to expand.
</ol>
For each similarity, it displays the 
<ul>
<li> the length of the given PEG,
<li> id2 (the id of the similar sequence),
<li> the length of id2,
<li> the p-score,
<li> the perecent identity, 
<li> the region of similarity, and
<li> the function assigned to the similar sequence.
</ul>

<p>
<blockquote>
<table border="1">
<tr>
<td>show_similarities.pl</td>
<td>show_similarities.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>
use FIG;
my $fig = new FIG;
use Sim;

use strict;

my($usage,$peg,$cutoff,$max,$expand);
my(@sims,$sim,$id2,$pscore,$iden,$b1,$e1,$b2,$e2,$ln1,$ln2,$func1,$func2);

$usage = "usage: show_similarities PEG CutOff MaxReturned Expand";

(
 ($peg    = shift @ARGV) &&
 ($cutoff = shift @ARGV) &&
 ($max    = shift @ARGV) &&
 defined($expand = shift @ARGV)
)
    || die $usage;

@sims = $fig->sims($peg,$max,$cutoff,"all",$expand);
foreach $sim (@sims)
{
    $ln1    = $sim->ln1;
    $id2    = $sim->id2;
    $ln2    = $sim->ln2;
    $pscore = $sim->psc;
    $iden   = $sim->iden;
    $b1     = $sim->b1;
    $e1     = $sim->e1;
    $b2     = $sim->b2;
    $e2     = $sim->e2;
    $func2  = $fig->function_of($id2);
    print join("\t",($ln1,$id2,$ln2,$pscore,$iden,$b1,$e1,$b2,$e2,$func2)),"\n";
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG
import sys, string

fig = FIG.FIG()
id1_ix = 0
ln1_ix = 12
id2_ix = 1
ln2_ix = 13
iden_ix = 2
psc_ix = 10
ali_ln_ix = 3
mismatches_ix = 4
gaps_ix = 5
bsc_ix = 11
b1_ix = 6
e1_ix = 7
b2_ix = 8
e2_ix = 9
bit_score_ix = 11
tool_ix = 14
def2_ix = 15
ali_ix = 16


def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 5:
	die("usage: show_similarities PEG CutOFF MaxReturned Expand")


peg = sys.argv[1]
cutoff = sys.argv[2]
max = sys.argv[3]
expand = sys.argv[4]

sims = fig.sims(peg, max, cutoff, "all", expand)

for sim in sims:
    ln1    = str(sim[ln1_ix])
    id2    = str(sim[id2_ix])
    ln2    = str(sim[ln2_ix])
    pscore = str(sim[psc_ix])
    iden   = str(sim[iden_ix])
    b1     = str(sim[b1_ix])
    e1     = str(sim[e1_ix])
    b2     = str(sim[b2_ix])
    e2     = str(sim[e2_ix])
    func2  = fig.function_of(id2)
    print string.join((ln1, id2, ln2, pscore, iden, b1, e1, b2, e2, func2[1]), "\t")
</pre></td>
</tr>
</table>
</blockquote>

<h3>How to Access Similarities</h3>
<hr>
If youn were to run this program using
<pre>
        show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 0
</pre>
it would produce an output file that started with the following lines:
<pre>
310     fig|216592.1.peg.4423   310     4e-178  99.68   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     sp|Q8XA82       310     9e-178  99.35   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     pirnr|NF01089494        310     2e-177  99.35   1       310     1       310     Homoserine kinase
310     pirnr|NF01137671        310     2e-177  99.35   1       310     1       310     Homoserine kinase (EC 2.7.1.39) (HK)
310     fig|216599.1.peg.3713   310     3e-177  99.35   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
</pre>
This output amounts to the first five unexpanded similarities.  If you
were to run
<pre>
        show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 2
</pre>
it would produce an output file that started with the following lines:
<pre>
310     sp|P00547       310     0       100     1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     pirnr|NF00702903        310     0       100     1       310     1       310     Homoserine kinase (EC 2.7.1.39) (HK)
310     kegg|eco:b0003  310     0       100     1       310     1       310     homoserine kinase (HK) [EC:2.7.1.39]
310     kegg|ecj:JW0002 310     0       100     1       310     1       310     Homoserine kinase [EC:2.7.1.39]
310     pirnr|NF01368612        310     0       100     1       310     -18     291     homoserine kinase
310     gi|33383907     310     0       100     1       310     -18     291     homoserine kinase
310     gi|16127997     310     0       100     1       310     1       310     homoserine kinase
310     fig|216592.1.peg.4423   310     4e-178  99.68   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     fig|216593.1.peg.3578   310     4e-178  99.68   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     fig|216598.1.peg.2077   310     4e-178  99.68   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     gi|10186209     310     4e-178  99.68   1       310     -21     288     homoserine kinase
310     gi|33383857     310     4e-178  99.68   1       310     -18     291     homoserine kinase
310     pirnr|NF00706256        310     4e-178  99.68   1       310     -21     288     Homoserine kinase (Fragment)
310     pirnr|NF01368414        310     4e-178  99.68   1       310     -18     291     homoserine kinase
310     tr|Q9ETA5       310     4e-178  99.68   1       310     -21     288     Homoserine kinase
310     sp|Q8XA82       310     9e-178  99.35   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     fig|155864.1.peg.3      310     9e-178  99.35   1       310     1       310     Homoserine kinase (EC 2.7.1.39)
310     kegg|ece:Z0003  310     9e-178  99.35   1       310     1       310     homoserine kinase [EC:2.7.1.39]
</pre>
The first seven lines of this expanded set of similarities amount to
identical matches against the set of sequences that were collapsed
into a single entry in the nonredundant database (i.e., sequences that
were essentially identical with <i>fig|83333.1.peg.3</i>).  The next
eight similarities correspond to the first "raw" similarity, and so
forth.
<br>
Now let us examine the program:
<ul>
<li>Use the <i>Sim.pm</i> module, if you are processing similarities.
That is, include
<pre>
        use Sim;
</pre>
at the head of your program.

<li>
The invocation
<pre>
        $fig->sims($peg,$max,$cutoff,"all",$expand)
</pre>
takes several parameters (the given PEG, the maximum number of
similarities to return, the p-score cutoff, a "selection" option, and
the number of raw similarities to expand).  
Itb returns a list of objects of type <i>similarity</i>.
The "selection" option can
be
<ul>
<li><i>raw</i> to get just raw similarities (equivalent to expanding
0)
<li><i>all</i> returns all similarities,
<li><i>fig</i> returns only similarities against FIG ids,
<li><i>external</i> returns only similarities against ids that are not
FIG ids,
</ul>
To understand thye impact of the parameters, I suggest that you take
the simple program and run it with altered parameters, or just use the
SEED web services to see what happens.
        

<li>
To extract values relating to the similarity, you would use
assignments as shown:
<pre>
    $ln1    = $sim->ln1;
    $id2    = $sim->id2;
    $ln2    = $sim->ln2;
    $pscore = $sim->psc;
    $iden   = $sim->iden;
    $b1     = $sim->b1;
    $e1     = $sim->e1;
    $b2     = $sim->b2;
    $e2     = $sim->e2;
</pre>
</ul>

There is a great deal more that could be said about similarities, but
for the purposes of this tutorial, that is it.

<h2>Functional Coupling</h2>

When I say that two genes are "functionally coupled", I mean that they
play closely related roles.  They may be multiple subunits of the same
complex, they may be enzymes from the same pathway, they may be
co-regulated, or whatever.  Functional coupling can be inferred using
many different technologies: 
<ul>
<li>detection of preserved contiguity on chromosomes,
<li>analysis of gene fusion event,
<li>detection of similar occurrence profiles, 
<li>creation of protein-protein interaction data, 
<li>detection of common transcription regulatory sites, and
<li>detection of common expression profiles
</ul>
While the SEED will eventually support all of these techniques (and,
now does support the first three), support for detection of preserved
contiguity is the only one that is built into the kernel.  
Hence, this discussion will focus on detection of functional coupling
via identification of preserved contiguity of genes.
<p>
The SEED supports one central role that can be used to summarize
preserved contiguity:
The following short program illustrates its use:

<p>
<blockquote>
<table border="1">
<tr>
<td>show_functional_coupling.pl</td>
<td>show_functional_coupling.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>
use FIG;
my $fig = new FIG;
use strict;

my($usage,$bound,$sim_cutoff,$fc_cutoff,@fc,$tuple,$peg);

$usage = "usage: show_functional_coupling Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs";

(
 ($bound = shift @ARGV) &&
 ($sim_cutoff = shift @ARGV) &&
 ($fc_cutoff  = shift @ARGV)
)
    || die $usage;

while (defined($_ = <STDIN>))
{
    if ($_ =~ /^(fig\|\d+\.\d+\.peg\.\d+)/)
    {
	$peg = $1;
	@fc = $fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
	foreach $tuple (@fc)
	{
#	    print &Dumper($tuple);
	    print "$peg\t$tuple->[0]\t$tuple->[1]\n";
	}
    }
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG2
import sys, re

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 4:
	die("usage: fc Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs")


bound = sys.argv[1]
sim_cutoff = sys.argv[2]
fc_cutoff = sys.argv[3]

for line in sys.stdin.readlines():	
    match =  re.search("^(fig\|\d+\.\d+\.peg\.\d+)", line)
    if match:
    	peg = match.group(1)
	fc  = fig.coupling_and_evidence(peg, bound, sim_cutoff, fc_cutoff)
	for tuple in fc:
	    print "%s\t%s\t%s" % ( peg, tuple[0], tuple[1])
</pre></td>
</tr>
</table>
</blockquote>

<h3>How to compute functional coupling scores based on preserved contiguity</h3>
<hr>
The short program simply reads in a file containing PEG ids.  For each
PEG, it invokes 
<pre>
        @fc =$fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
</pre>

to compute a list of 3-tuples.  Each 3-tuple corresponds to a
functionally coupled gene.  The 3-tuple contains
<ol>
<li> the number of occurrences of preservation of homologous pairs
that were detected in somewhat diverged genomes, 
<li> the PEG id of a functionally coupled gene, and
<li> a pointer to a list of 2-tuples that describe the pairs in
different genomes.
</ol>
If the program is run using the following command
<pre>
         show_functional_coupling 5000 1.0e-10 2 < tmp
</pre>
and if <i>tmp</i> contains a single line of the form
<pre>
fig|83333.1.peg.3
</pre>
then the program will produce the following output:
<pre>
$VAR1 = [
          77,
          'fig|83333.1.peg.2',
          [
            [
              'fig|263.1.peg.1622',
              'fig|263.1.peg.1624'
            ],
            [
              'fig|263.1.peg.1622',
              'fig|263.1.peg.1625'
            ],
            [
              'fig|562.2.peg.2720',
              'fig|562.2.peg.2721'
            ],
            [
              'fig|573.2.peg.4598',
              'fig|573.2.peg.4594'
            ],
            .
            .          <<< deleted lines showing more pairs >>>
            .
          ]
        ];
$VAR1 = [
          44,
          'fig|83333.1.peg.4',
          [
            [
              'fig|263.1.peg.1622',
              'fig|263.1.peg.1620'
            ],
            [
              'fig|562.2.peg.2526',
              'fig|562.2.peg.2527'
            ],
            [
              'fig|562.2.peg.2521',
              'fig|562.2.peg.2522'
            ],
            [
              'fig|562.2.peg.2525',
              'fig|562.2.peg.2527'
            ],
            [
              'fig|573.2.peg.4598',
              'fig|573.2.peg.4599'
            ],
            .
            .          <<< deleted lines showing more pairs >>>
            .
          ]
        ];
$VAR1 = [
          10,
          'fig|83333.1.peg.6',
          [
            [
              'fig|562.2.peg.2526',
              'fig|562.2.peg.2529'
            ],
            [
              'fig|562.2.peg.2525',
              'fig|562.2.peg.2529'
            ],
            .
            .          <<< deleted lines showing more pairs >>>
            .
          ]
        ];
$VAR1 = [
          7,
          'fig|83333.1.peg.7',
          [
            [
              'fig|562.2.peg.2526',
              'fig|562.2.peg.2531'
            ],
            [
              'fig|562.2.peg.2526',
              'fig|562.2.peg.2530'
            ],
            [
              'fig|562.2.peg.2525',
              'fig|562.2.peg.2531'
            ],
            [
              'fig|562.2.peg.2525',
              'fig|562.2.peg.2530'
            ],
            [
              'fig|573.2.peg.4598',
              'fig|573.2.peg.4604'
            ],
            .
            .          <<< deleted lines showing more pairs >>>
            .
          ]
        ];
$VAR1 = [
          7,
          'fig|83333.1.peg.8',
          [
            [
              'fig|594.1.peg.4954',
              'fig|594.1.peg.4958'
            ],
            [
              'fig|630.2.peg.581',
              'fig|630.2.peg.584'
            ],
            [
              'fig|12149.1.peg.2',
              'fig|12149.1.peg.6'
            ],
            .
            .          <<< deleted lines showing more pairs >>>
            .
          ]
        ];
</pre>
If you simply replaced
<pre>
            print &Dumper($tuple);
</pre>
with
<pre>
            print "$peg\t$tuple->[0]\t$tuple->[1]\n";
</pre>
you would get
<pre>
fig|83333.1.peg.3       77      fig|83333.1.peg.2
fig|83333.1.peg.3       44      fig|83333.1.peg.4
fig|83333.1.peg.3       10      fig|83333.1.peg.6
fig|83333.1.peg.3       7       fig|83333.1.peg.7
fig|83333.1.peg.3       7       fig|83333.1.peg.8

</pre>
which might be closer to what most end users wish to see.
<p>
The only detail that might be worth noting is the scoring function.
It counts the number of pairs (besides the pair composed of the given
PEG and the related PEG) that could be located, where "essentially
identical" occurrences are collapsed to a single occurrence.  For
example, if you had 10 <i>E.coli</i> genomes (of closely related
strains), you would wish to count them as a single occurrence.
Basically, we collapse occurrences between genes that are encode
protein sequences that are more than 97% identical to a single
occurrence.  So, when you see
<pre>
fig|83333.1.peg.3       77      fig|83333.1.peg.2
</pre>
This means that there are genes encoding proteins similar to those
encoded by fig|83333.1.peg.3 and fig|83333.1.peg.2 that are within
5000 bases of one another in 77 quite distinct genomes.

<h2>Maps, Subsystems, Functional Roles, and gene Functions</h2>

The SEED currently supports four distinct notions that I attempt to
differentiate in ths section.  My working definitions would be as
follows:
<ol>
<li>
A <b>map</b> is a relatively large set of related functional roles.
The notion grew out of working with KEGG maps.  It encompasses larger
functional units than the related notion <b>pathway</b>.
<li>The notion of <b>subsystem</b> is a slight extension of the
concept <b>pathway</b>.  It is a set of functional roles that together
play a related function.  A pathway is a metabolic subsystem.  A
ribosome is a nonmetabolic subsystem.
<p>
Since KEGG is doing such a useful job in attempting to curate maps, we
download their maps and use them.  We supplement them occasionally
with some drawn by ourselves, but we do think that the community owes
KEGG real gratitude for making these maps available.

<li>My notion of <b>functional role</b> is roughly an abstraction of
the function of a gene (i.e., a monfunctional gene).  It is a
well-defined component of a subsystem.  Variants of subsystems should
be thought of as sets of functional roles (i.e., not as sets of
genes).

<li>A <b>gene function</b> may be thought of as a set of functional
roles.  Often there is just one, but many genes are multifunctional.
Somestimes this means that a single domain of the protein encoded by
the gene has low specificity which allows the gene to play multiple
roles, and sometimes the gene has several domains with each domain
implementing a complete functional role.  Either way, if a gene
function implements multiple roles, it should have the form
<pre>
        role1 / role2 / role3...
</pre>
That is, the gene function should be made up of a list of roles
separated by " / ".
A gene function is said to <i>connect</i> to each of the roles that
make it up (those roles that are separated by " / "), as well as any
embedded EC numbers.  Thus, the gene function
<pre>
        Pyruvate kinase (EC 2.7.1.40) / 6-phosphofructokinase (EC 2.7.1.11)
</pre>
would connect to 
<ol>
<li>Pyruvate kinase (EC 2.7.1.40),
<li>6-phosphofructokinase (EC 2.7.1.11),
<li>2.7.1.40, and
<li>2.7.1.11
</ol>
</ol>
<p>
Normally, when we speak of maps, most of the functional roles are
simply EC numbers (although some are not).  When we speak of
subsystems, the functional roles are almost never specified as just an
EC number.
<p>
Our first simple example program takes as command line arguments a map
ID (i.e., one of the KEGG map numbers -- these can be taken off the
initial page displayed by the SEED web services) and a genome ID.  It
displays the map ID and the "map name".  Then, it lists the roles
contained within the map, and for each role it displays the PEGs
associated with the role.  Thus, 

<pre>
        connect_map_to_genes MAP00010 83333.1
</pre>
produced the following output for me:
<pre>
MAP00010        Glycolysis / Gluconeogenesis
  2.7.1.11
    fig|83333.1.peg.1706        6-phosphofructokinase II (EC 2.7.1.11)
    fig|83333.1.peg.3835        6-phosphofructokinase (EC 2.7.1.11)
  2.7.1.41
  1.2.1.3
    fig|83333.1.peg.3522        Aldehyde dehydrogenase (EC 1.2.1.3)
  5.4.2.1
    fig|83333.1.peg.3548        2,3-bisphosphoglycerate-independent phosphoglycerate mutase (EC 5.4.2.1)
    fig|83333.1.peg.4303        Phosphoglycerate mutase (EC 5.4.2.1)
    fig|83333.1.peg.741 Phosphoglycerate mutase (EC 5.4.2.1)
  3.1.3.10
    fig|83333.1.peg.989 Glucose-1-phosphatase precursor (EC 3.1.3.10)
  1.1.1.27
    fig|83333.1.peg.787 L-lactate dehydrogenase (EC 1.1.1.27)
  1.8.1.4
    fig|83333.1.peg.116 Dihydrolipoamide dehydrogenase (EC 1.8.1.4)
  3.6.1.7
    fig|83333.1.peg.953 Acylphosphatase (EC 3.6.1.7)
  2.7.1.40
    fig|83333.1.peg.1660        Pyruvate kinase (EC 2.7.1.40)
    fig|83333.1.peg.1838        Pyruvate kinase (EC 2.7.1.40)
  2.7.1.1
  3.1.3.9
  4.1.1.1
  5.1.3.15
  1.1.99.8
  1.2.4.1
    fig|83333.1.peg.114 Pyruvate dehydrogenase E1 component (EC 1.2.4.1)
  5.1.3.3
    fig|83333.1.peg.742 Aldose 1-epimerase (EC 5.1.3.3)
  6.2.1.1
    fig|83333.1.peg.3979        Acetyl-coenzyme A synthetase (EC 6.2.1.1)
  5.4.2.4
  2.7.1.69
    fig|83333.1.peg.1086        PTS system, glucose-specific IIBC component (EIIBC-Glc) (Glucose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
    fig|83333.1.peg.1607        PTS system, maltose and glucose-specific IIABC component (Maltose and glucose-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
    fig|83333.1.peg.1719        PTS system, cellobiose-specific IIA component (EIIA-Cel) (Cellobiose- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
    fig|83333.1.peg.1721        PTS system, cellobiose-specific IIB component (EIIB-Cel) (Cellobiose- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
    fig|83333.1.peg.1800        PTS system, mannose-specific IIAB component (EC 2.7.1.69)
    fig|83333.1.peg.2068        PTS system, galactitol-specific IIB component (EIIB-GAT) (Galacticol- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
    fig|83333.1.peg.2069        PTS system, galactitol-specific IIA component (EIIA-GAT) (Galacticol- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
    fig|83333.1.peg.2142        PTS system, fructose-specific IIBC component (EIIBC-Fru) (Fructose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
    fig|83333.1.peg.2144        PTS system, fructose-specific IIA/FPr component (EIIA-Fru) (Fructose- permease IIA/FPr component) (Phosphotransferase enzyme II, A/FPr component) (EC 2.7.1.69)
    fig|83333.1.peg.2165        PTS system, mannitol-specific IIBC component (EC 2.7.1.69)
    fig|83333.1.peg.2360        PTS system, fructose-specific IIBC component (EC 2.7.1.69)
    fig|83333.1.peg.2385        PTS system, glucose-specific IIA component (EC 2.7.1.69)
    fig|83333.1.peg.2657        Sorbitol PTS, EIIC (EC 2.7.1.69)
    fig|83333.1.peg.2658        PTS system, glucitol/sorbitol-specific IIBC component (EC 2.7.1.69)
    fig|83333.1.peg.2659        PTS system, glucitol/sorbitol-specific IIA component (EIIA-gut) (Glucitol/sorbitol-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
    fig|83333.1.peg.2670        PTS system, arbutin-, cellobiose-, and salicin-specific IIABC component (EIIABC-ASC) (Arbutin-, cellobiose-, and salicin-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
    fig|83333.1.peg.2884        PTS system, mannitol (Cryptic) -specific IIBC component (EIIBC-(C)MTL) (Mannitol (Cryptic)-permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
    fig|83333.1.peg.2885        PTS system, mannitol (Cryptic) -specific IIA component (EIIA-(C)MTL) (Mannitol (Cryptic)-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
    fig|83333.1.peg.3083        PTS system, N-acetylgalactosamine-specific IIB component 1 (EIIB-AGA) (N-acetylgalactosamine-permease IIB component 1) (Phosphotransferase enzyme II, B component 1) (EC 2.7.1.69)
    fig|83333.1.peg.3147        Nitrogen regulatory IIA protein (EC 2.7.1.69)
    fig|83333.1.peg.3534        PTS system, mannitol-specific IIABC component (EIIABC-MTL) (Mannitol- permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
    fig|83333.1.peg.3659        PTS system, beta-glucoside-specific IIABC component (EIIABC-Bgl) (Beta-glucoside-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
    fig|83333.1.peg.3820        PTS system, fructose-specific IIBC component (EC 2.7.1.69)
    fig|83333.1.peg.3821        PTS system, fructose-specific IIABC component (EC 2.7.1.69)
    fig|83333.1.peg.3869        PTS system, fructose-specific IIBC component (EC 2.7.1.69)
    fig|83333.1.peg.3870        PTS system, fructose-like-2 IIB component 1 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
    fig|83333.1.peg.3873        PTS system, fructose-like-2 IIB component 2 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
    fig|83333.1.peg.4150        PTS system, trehalose-specific IIBC component (EIIBC-TRE) (Trehalose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
    fig|83333.1.peg.4213        PTS system, galactitol-specific IIC component (EC 2.7.1.69)
    fig|83333.1.peg.669 PTS system, glucose-specific IIABC component (EC 2.7.1.69)
    fig|83333.1.peg.723 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
  1.1.1.71
  1.1.1.1
    fig|83333.1.peg.1228        Aldehyde-alcohol dehydrogenase [Includes: Alcohol dehydrogenase (EC 1.1.1.1) (ADH); Acetaldehyde dehydrogenase [acetylating] (EC 1.2.1.10) (ACDH); Pyruvate-formate-lyase deactivase (PFL deactivase)]
    fig|83333.1.peg.1301        Alcohol dehydrogenase (EC 1.1.1.1)
    fig|83333.1.peg.3523        Alcohol dehydrogenase (EC 1.1.1.1)
    fig|83333.1.peg.353 Alcohol dehydrogenase class III (EC 1.1.1.1)
  2.7.1.63
  3.2.1.86
    fig|83333.1.peg.1717        6-phospho-beta-glucosidase (EC 3.2.1.86)
    fig|83333.1.peg.2853        6-phospho-beta-glucosidase bglA (EC 3.2.1.86)
    fig|83333.1.peg.3658        6-phospho-beta-glucosidase bglB (EC 3.2.1.86)
  3.1.6.3
  1.2.1.12
    fig|83333.1.peg.1762        Glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12)
  2.7.2.3
    fig|83333.1.peg.2877        Phosphoglycerate kinase (EC 2.7.2.3)
  5.3.1.9
    fig|83333.1.peg.3934        Glucose-6-phosphate isomerase (EC 5.3.1.9)
  1.2.1.5
  1.2.1.51
  4.2.1.11
    fig|83333.1.peg.2735        Enolase (EC 4.2.1.11)
  5.3.1.1
    fig|83333.1.peg.3838        Triosephosphate isomerase (EC 5.3.1.1)
  3.1.3.13
  4.1.2.13
    fig|83333.1.peg.1504        Fructose-bisphosphate aldolase (EC 4.1.2.13)
    fig|83333.1.peg.2072        Fructose-bisphosphate aldolase (EC 4.1.2.13)
    fig|83333.1.peg.2876        Fructose-bisphosphate aldolase (EC 4.1.2.13)
  2.7.1.2
    fig|83333.1.peg.2361        Glucokinase (EC 2.7.1.2)
  2.3.1.12
    fig|83333.1.peg.115 Dihydrolipoamide acetyltransferase component of pyruvate dehydrogenase complex (EC 2.3.1.12) (E2)
  1.1.1.2
  3.1.3.11
    fig|83333.1.peg.2881        Fructose-1,6-bisphosphatase (EC 3.1.3.11)
    fig|83333.1.peg.4142        Fructose-1,6-bisphosphatase (EC 3.1.3.11)
  5.4.2.2
    fig|83333.1.peg.678 Phosphoglucomutase (EC 5.4.2.2)
</pre>

Now let's examine the short program that produced the output:

<p>
<blockquote>
<table border="1">
<tr>
<td>connect_map_to_genes.pl</td>
<td>connect_map_to_genes.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>
use FIG;
my $fig = new FIG;
use Sim;

use strict;
my($usage,$map,$genome,$name,$role,$peg,$func,$expanded_role);

$usage = "usage: connect_map_to_genes Map Genome";

(
 ($map    = shift @ARGV) &&
 ($genome = shift @ARGV)
)
    || die $usage;

$name = $fig->map_name($map);
print "$map\t$name\n";
foreach $role ($fig->map_to_ecs($map))
{
    print "  $role\n";
    foreach $peg ($fig->seqs_with_role($role,"master",$genome))
    {
	$func = $fig->function_of($peg);
	print "    $peg\t$func\n";
    }
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 3:
	die("usage: connect_map_to_genes Map Genome")

map = sys.argv[1]
genome = sys.argv[2]
name = fig.map_name(map);

print "%s\t%s" % (map, name)
roles = fig.map_to_ecs(map)
for role in roles:
    print role
    pegs = fig.seqs_with_role(role,"master",genome)
    for peg in pegs:
	func = fig.function_of(peg);
	print "    %s\t%s" % (peg, func[0][1])
</pre></td>
</tr>
</table>
</blockquote>

returns the full name associated with the map.
<li>You can get the the list of functional roles contained within a map using
<pre>
        $fig->map_to_roles($map)
</pre>
<li>Because the roles within a map are often just EC numbers, you will
usually wish to "expand" them by attaching a common name for the
function represented by the EC number.  You can do this using
<pre>
        $expanded_role = $fig->expand_ec($role)
</pre>
I did not do that in this example, since I wanted you to see the EC
numbers being used as functional roles in maps.  You should modify the
program to expand the ECs and rerun it.
<li>
Finally, you can get a list of the PEGs within a genome that have gene functions
which connect to a specific functional role using
<pre>
        $fig->seqs_with_role($role,"master",$genome)
</pre>
</ul>
Since maps and subsystems are remarkably similar notions (both are
sets of functional roles), you could also construct a similar program
for subsystems.  In the SEED, subsystems do not have IDs which are
associated with full names; rather, the full name is itself the ID of
the subsystem.  With this in mind, the following program should be
reasonably clear:

<p>
<blockquote>
<table border="1">
<tr>
<td>connect_subsystem_to_genes.pl</td>
<td>connect_subsystem_to_genes.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>

use FIG;
my $fig = new FIG;
use Sim;

use strict;
my($usage,$subsystem,$genome,$name,$role,$peg,$func);

$usage = "usage: connect_subsystem_to_genes Subsystem Genome";

(
 ($subsystem    = shift @ARGV) &&
 ($genome       = shift @ARGV)
)
    || die $usage;

print "$subsystem\n";
foreach $role ($fig->subsystem_to_roles($subsystem))
{
    print "  $role\n";
    foreach $peg ($fig->seqs_with_role($role,"master",$genome))
    {
	$func = $fig->function_of($peg);
	print "    $peg\t$func\n";
    }
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 3:
	die("usage: connect_subsystem_to_genes Subsystem Genome")

subsystem  = sys.argv[1]
genome = sys.argv[2]

print subsystem
roles = fig.subsystem_to_roles(subsystem)
for role in roles:
    print role
    pegs = fig.seqs_with_role(role,"master",genome)
    for peg in pegs:
	func = fig.function_of(peg);
	print "%s\t%s" % ( peg, func[0][1])
</pre></td>
</tr>
</table>
</blockquote>

<h3>Locating genes within an organism that connect to a specific subsystem</h3>
<hr>
Invoking this program using
<pre>
        connect_subsystem_to_genes 'Embden-Meyerhof and Gluconogenesis' 83333.1
</pre>
would produce the following output:
<pre>
</pre>
Note that the functional roles in subsystems tend to be far more
detailed (and almost never just an EC number).
<p>
The problem with the above program is that we have chosen to view a
subsystem as a spreadsheet in which each cell corresponds to a
specific functional role in a designated organism.  The gene functions
of the PEGs contained in the cell usually connect to the specific
role, but this is not forced.  If someone changes the function of a
gene, it does not disappear from the spreadsheet, unless the
spreadsheet is rebuilt by forcing the entries to reflect current
assignments and how they connect to the roles.  We have chosen this
approach to allow curators of spreadsheets to have precise control
over the contents.  People who import a subsystem may or may not
instal the gene assignments needed to make the spreadsheet
consistent, but the spreadsheet reflects the curators best assessment
of which genes implement specific roles.  This leads to my assertion
that the following program is a better way to display a subsystem:

<p>
<blockquote>
<table border="1">
<tr>
<td>connect_subsystem_to_genes_the_right_way.pl</td>
<td>connect_subsystem_to_genes_the_right_way.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>

use FIG;
my $fig = new FIG;
use Sim;

use strict;
my($usage,$subsystem,$genome,$name,$role,$peg,$func);

$usage = "usage: connect_subsystem_to_genes_the_right_way Subsystem Genome";

(
  ($subsystem    = shift @ARGV) &&
  ($genome       = shift @ARGV)
 )
    || die $usage;

print "$subsystem\n";
foreach $role ($fig->subsystem_to_roles($subsystem))
{
    print "  $role\n";
    foreach $peg ($fig->pegs_in_subsystem_cell($subsystem, $genome, $role))
    {
	$func = $fig->function_of($peg);
	print "    $peg\t$func\n";
    }
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 3:
	die("usage: connect_subsystem_to_genes_the_right_way Subsystem Genome")

subsystem = sys.argv[1]
genome  = sys.argv[2]

print subsystem
roles = fig.subsystem_to_roles(subsystem)

for role in roles:
    print role
    pegs = fig.pegs_in_subsystem_cell(subsystem, genome, role)
    for peg in pegs: 
	func = fig.function_of(peg)
	print "    %s\t%s" % (peg, func[0][1])
</pre></td>
</tr>
</table>
</blockquote>

<h3>Locating genes within an organism that connect to a specific
subsystem the right way</h3>
<hr>

<p>
We have just illustrated how to connect maps to functional roles and
functional roles to gene functions.  Now we need to go the other way.
That is, given a gene function, what roles would it 
connect to, and which maps and subsystems contain those roles?
The following program illustrates the services required to answer this
question;

<p>
<blockquote>
<table border="1">
<tr>
<td>gf2maps_and_subs.pl</td>
<td>gf2maps_and_subs.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>
use FIG;
my $fig = new FIG;

use strict;
my($usage,$peg,$func,@roles,$role,@maps,$map);
my(%in_map,@maps_it_connects_to,@subsystems_it_is_in);
my($map_name,$subsystem);

$usage = "usage: gf2maps_and_subs PEG";

# The single input argument must be a FIG id, which is used to get the assigned function.
# The program then prints a list of maps to which the gene can be connected,
# followed by a list of subsystems that contain the gene.

($peg = shift @ARGV)
    || die $usage;

if (($func = $fig->function_of($peg)) && (! &FIG::hypo($func)))
{
    @roles = $fig->roles_of_function($func);
    foreach $role (@roles)
    {
	@maps = $fig->role_to_maps($role);
	foreach $map (@maps)
	{
	    $in_map{$map} = 1;
	}
    }

    @maps_it_connects_to       = sort keys(%in_map);

    if (@maps_it_connects_to > 0)
    {
	print "Maps that connect to $func\n\n";
	foreach $map (@maps_it_connects_to)
	{
	    $map_name = $fig->map_name($map);
	    print "$map\t$map_name\n";
	}
	print "\n";
    }

}
else
{
    print STDERR "you need to give a valid PEG with a non-hypothetical function to connect to maps\n";
}

@subsystems_it_is_in = $fig->peg_to_subsystems($peg);

if (@subsystems_it_is_in > 0)
{
    print "Subsystems that contain $peg\n\n";
    foreach $subsystem (@subsystems_it_is_in)
    {
	print "$subsystem\n";
    }
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG2
import sys

fig = FIG2.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) != 2:
	die("usage: gf2maps_and_subs PEG")

# The single input argument must be a FIG id, which is used to get the assigned function.
# The program then prints a list of maps to which the gene can be connected,
# followed by a list of subsystems that contain the gene.

peg = sys.argv[1]

func = fig.function_of(peg)
func = func[0][1]
hypo = fig.hypo(func)
in_map = {}

if func and hypo[0] == 0: 
    roles = fig.roles_of_function(func)
    for role in roles:
	maps = fig.role_to_maps(role)
	for map in maps:
	    in_map[map] = 1

    maps_it_connects_to = in_map.keys()
    maps_it_connects_to.sort()

    if len(maps_it_connects_to) > 0:
	print "Maps that connect to %s\n" % func
	for map in maps_it_connects_to:
	    map_name = fig.map_name(map)
	    print "%s\t%s" % (map, map_name[0])
	print " ";
else:
    sys.stderr.write("you need to give a valid PEG with a non-hypothetical function to connect to maps\n")

subsystems_it_is_in = fig.peg_to_subsystems(peg)

if len(subsystems_it_is_in) > 0:
    print "Subsystems that contain %s\n" % peg
    for subsystem in subsystems_it_is_in:
	print subsystem
</pre></td>
</tr>
</table>
</blockquote>

<h3>Going from a PEG to the maps and subsystems containing it</h3>
<hr>
Here, again, there are some of points worth discussing:
<ul>
<li> First, note that the input argument is a PEG, which we use to get
the function that we will work with.  You learned how to get the
function in a previous example, but note the use of
<pre>
        &FIG::hypo($func)
</pre>
to check to see whether or not the function is equivalent to
<i>hypothetical protein</i>. 
<p>
The way we explore connections between the PEG and maps is quite
different than the way we connect the PEG to subsystems.  In the case
of maps, the connection is constructed by computing the set of roles
to which the gene function connects, and then connecting each role to
maps.  In the case of subsystems, we simply look up which subsystems
contain the given PEG.  Subsystems explicitly contain a precise list
of the PEGs that occur within them, which is not true of maps.

<li>The key services for getting at the connections are as follows:
<pre>
    @roles      = $fig->roles_of_function($func);
    @maps       = $fig->role_to_maps($role)
    @subsystems = $fig->peg_to_subsystems($peg);
</pre>
The first takes a function as input and returns a list of roles that
the function connects to, the second takes a role and returns a list
of maps to which the role connects, and the last takes a peg and
returns the list of subsystems that contain the PEG.
</ul>
If you take a few minutes out and think about maps and subsystems, you
will see that they are very, very similar concepts that are treated
quite differently by the SEED.

<h2>Annotations</h2>

Annotations are time-stamped text notes made by a named user and
attached to specific features.  The most common are notes recording
when someone made an assignment of function to a PEG, but they can be
used to record anything.  The usual way to access the annotations
attached to a given feature with an ID designated by <i>$fid</i> is to
use
<pre>
        @annotations = $fig->feature_annotations($fid)
</pre>
The returned list contains 4-tuples (i.e., pointers to lists, each of
which contains 4 entries).  Each 4-tuple contains
<ol>
<li> the FID (this is redundant, but it was convenient for reasons
that will become apparent),
<li>a timestamp,
<li>the name of the user that made the annotation, and
<li>the annotation in the form of a text string.
</ol>
The following short program uses 
<pre>
        @annotations = $fig->merged_related_annotations($fig_ids)
</pre>
which is a useful generalization.  This routine takes a pointer to a list of
feature IDs as input, and it produces a merged list of the annotations
attached to all of the features.  This can often be very convenient.
For example, you might have a set of similar PEGs that are all
misannotated, and you wish to reconstruct the steps that led to the
current situation.  By merging the annotations (which normally record
both who made the annotations and why), you can get a record of the
events in chronilogical order.  The following short example
illustrates this utility:

<p>
<blockquote>
<table border="1">
<tr>
<td>annotations_of.pl</td>
<td>annotations_of.py</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>
use FIG;
my $fig = new FIG;
use strict;

my($usage,$fig_ids,@annotations);

$usage = "usage: annotations_of Fid1 Fid2 ...";

(@ARGV > 0)
    || die $usage;

$fig_ids = [@ARGV];

@annotations = $fig->merged_related_annotations($fig_ids);
if (@annotations > 0)
{
    &display_annotations(\@annotations);
}
else
{
    print STDERR "Sorry, there were no annotations attached to the features\n";
}

sub display_annotations {
    my($annotations) = @_;
    my($annotation,$fid,$timestamp,$user,$text);

    foreach $annotation (@$annotations)
    {
	($fid,$timestamp,$user,$text) = @$annotation;
	print "=======\n";
	print "$user\t$timestamp\t$fid\n$text\n";
	print "=======\n\n";
    }
}
</pre></td>
<td bgcolor="f0f0f0"><pre>
from FigKernelPackages import FIG
import sys

def display_annotations(annotations):
    for annotation in annotations:
	fid,timestamp,user,text = annotation
	print "=======\n";
	print "%s\t%s\t%s\n%s\n" % ( user, timestamp, fid, text)
	print "=======\n\n"

fig = FIG.FIG()

def die(msg):
	print msg
	sys.exit(0)

if len(sys.argv) < 2:
	die("usage: annotations_of Fid1, Fid2, ... ")

fig_ids = sys.argv[1:]

annotations = fig.merged_related_annotations(fig_ids)
if len(annotations) > 0:
    display_annotations(annotations)
else:
    sys.stderr.write("Sorry, there were no annotations attached to the features\n")

</pre></td>
</tr>
</table>
</blockquote>


<h3>Accessing the annotations from a set of features</h3>
<hr>
Invoking this little utility using
<pre>
        annotations_of 'fig|207559.1.peg.735' 'fig|12149.1.peg.2133' 'fig|1525.1.peg.1757' 
</pre>
produced the following output:
<pre>
=======
automated_assignment    Thu Mar 18 17:42:51 2004        fig|12149.1.peg.2133
Set master function to
ATP phosphoribosyltransferase (EC 2.4.2.1
=======

=======
automated_assignment    Thu Mar 18 17:45:16 2004        fig|1525.1.peg.1757
Set master function to
ATP phosphoribosyltransferase (EC 2.4.2.1
=======

=======
last_release    Fri Apr  2 14:14:07 2004        fig|207559.1.peg.735
Set master function to
ATP phosphoribosyltransferase (EC 2.4.2.1
=======
</pre>

<h2>Assigning Gene Functions</h2>

We can now easily talk about assigning functions and making
annotations to record the events.  Here is a simple utility
that can be used to make annotations:

<p>
<blockquote>
<table border="1">
<tr>
<td>batch_assignments.pl</td>
</tr>
<tr>
<td bgcolor="f0f0f0"><pre>
use FIG;
my $fig = new FIG;

use strict;
my($usage,$user,$annotation,$text,$peg,$func,$conf,$user_name,$funcO);

$usage = "usage: batch_assignments User [AnnotationFile] < Assignments";

($user = shift @ARGV)
    || die $usage;

if ($annotation = shift @ARGV)
{
    $text = join("",`cat $annotation`);
}
else
{
    $text = "";
}

while (defined($_ = <STDIN>))
{
    chop;
    ($peg,$func,$conf) = split(/\t/,$_);
    if (! $conf) { $conf = "" }

    $funcO = $fig->function_of($peg);
    if ($funcO ne $func)
    {
        #  annotations are now made by assign_function
	$fig->assign_function($peg,$user,$func,$conf);
    }
}
    
</pre></td>
</tr>
</table>
</blockquote>

<h3>A utility to make batch assignments</h3>
<hr>
Before discussing the details of this code, we should discuss the
issue of "master and non-master users".  Each gene can have a single
<i>master assignment</i> and any number of <i>non-master
assignments</i>.  A user who has signed in as <i>master:USER</i>
(e.g., "master:RossO") is a master user.  If a master user makes an
assignment, it overwrites the master assignment.  If a non-master user
makes an assignment, it alters only the assignment corresponding to
that user, if there is one.  When a master asks for the function of a
gene, he gets the master assignment.  If a non-master asks for the
function of a gene, he will get the assignment corresponding to his
user id, if there is one; if not, he will get the master assignment,
if there is one.
<p>
There is also the matter of <i>confidence levels</i>.  The SEED
supports attaching confidence codes to assignments, although they are
seldom used.  The codes we use at this point are:
<ul>
<li> an empty string, a space or <b>N</B> means "normal or average confidence",
<li> an <b>H</b> or an <b>S</B> means "high or strong confidence", and
<li>an <b>L</b> or a <b>W</b> means "low or weak confidence".
</ol>
We are not really employing the codes actively.  The major source of
strength in an assignment is being part of a subsystem or a
well-characterized class of proteins.  This will clarify substantially
as the Project to Annotate 1000 Genomes progresses.
<p>
Now, back to the example program:
<ul>
<li>
  The program is written to allow the
user to optionally attach a block of annotation to each of the genes
that get reassigned.  If the optional argument (the second argument)
is prtesent, it is presumed to be the name of a file containing text
to be appended to the annotation the program makes to record the
change of assignment. The line of code
<pre>
    $text = join("",`cat $annotation`);
</pre>
just reads the entire file, concatenates the lines, and assigns the
result to the variable <i>$text</i>.
<li>
We default the <i>confidence level</i> to the empty string.
<li> We only make asssignments if the current function differs from
the one to be assigned.  <i>$funcO</i> is used to hold the existing
asssignment.
<li>
The 
<pre>
	    $fig->assign_function($peg,$user,$func,$conf);
</pre>
is used to actually make the assignment in the database, record it in in the
current assignments file, and record an annotation of the annotation history
file.
</ul>

<h2>BBHs</h2>

Now, let us return briefly to the topic of similarities.  The notion
of <i>bi-directional best hit</i> (<b>BBH</b>) often pops up in
discussions relating to similarities and how to project assignments
from a gene in one genome to a gene in another.  A gene <i>X</i> in
genome <i>Gx</i> is a BBH of a gene <i>Y</i> in <i>Gy</i> iff 
<ol>
<li><i>X</i> is similar to <i>Y</i>,
there is no other gene in <i>Gy</i> more similar to <i>X</i>, and
there is no other gene in <i>Gx</i> more similar to <i>Y</i>.
</ol>
To get the BBHs for a given PEG, you would use the something like the
following:
<pre>
	@bbhs = $fig->bbhs($peg,1.0e-10);
</pre>
This would set <i>@bbhs</i> to a list of 2-tuples.  Each 2-tuple would
contain a PEG and a p-score.  They would correspond to BBHs which have
similarity better than 1.0e-10.

<h2>The Use of dsims</h2>

We have constructed a means of computing <i>dynamic similarities</i>
fairly rapidly.  The basic idea involves breaking the nonredundant
protein database into many thousands of subsets, where all of the
sequences in a subset are quite similar (sequences that do not fall
into subsets get placed into a <i>nohits database</i>).  Then, one or more
representative sequences from each subset are included in a database
of <i>exemplars</i>.  
<p>
To get <i>blastp</i> similarities quickly (and somewhat less
accurately than just blasting against the nonredundant database), use
something like
<pre>
	@sims = $fig->dsims($id,$seq,200,1.0e-10,"raw")
</pre>
This would return the similarities that can rapidly be detected
between the protein sequence with id given by <i>id</i> and sequence
given by <i>$seq</i>.  It would give you at most 200 similarities
with p-scores better than 1.0e-10, and you would get only the
<i>raw</i> similarities (<i>all</i> would return the expanded
similarities).

<h1> PAUSE: I am pausing here; I hope to finish this tutorial within a few months</h1>

<h2>Clusters, Pins, and Largest Clusters</h2>

<h2>The Indexes</h2>

<h2>Administrative Issues</h2>

<h2>Atomated Assignment of Function</h2>

<h2>Finding Candidates for a Functional Role</h2>

<h2>Protein Families</h2>

<h2>Compounds and Reactions</h2>

<h2>Subsystems</h2>

<h2>Links to PEGs</h2>

<h2>Miscellaneous Stuff that You Should Know</h2>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3