Accessing SEED Services

Introduction

This document is designed to help a programmer understand how to access the services provided by the SEED. It is not designed for the user who wishes to confine his experience to working with the SEED browser over the Web.

Before we launch into a discussion of the details of how to invoke specific routines, there are some background issues that you should know about. First, let's talk about fig, which is a command-line utility that is used for a number of purposes. Its most basic purpose is to exercise the API and illustrate what data is accessed by each operation. However, you may also find it extremely useful as a quick way to invoke services. To try it out, type

        fig
This should produce a prompt of the form
        ??
You should type
        ?? h<cr>
to get a listing of the commands supported by the utility. At the time this document was written (May, 2004), the output was as follows:
?? h
    DB                              Current DB
    abbrev                          genome_name
    access_sims                     [expand|func|expandfunc]       (for timeing - access 10 distinct sims)
    add_annotation                  FID User [ prompted for annotation; terminate with "." at start of line ]
    add_chromosomal_clusters        File
    add_genome                      GenomeDir
    add_pch_pins                    File
    add_peg_link                    PEG Link (e.g., <a href=http//...>link to somewhere</a>)
    all_compounds               
    all_features                    GenomeID Type
    all_maps                 
    all_protein_families
    all_reactions
    all_roles
    all_sets                        Relation SetName
    assign_function                 PEG User [conf=X] Function
    assign_functionF                User File [no_annotations]
    assignments_made                who date G1 G2 ...  [ none => all ]
    auto_assign                     PEG [Seq]
    auto_assignF                    User FileOfPEGs
    auto_assignG                    User GenomeID1 GenomeID2 ...
    bbhs                            PEG Cutoff
    between                         n1 n2 n3
    blast                           PEG  [against nr]
    blastit                         PEG seq db maxP
    boundaries_of                   Loc
    build_tree_of_complete          min_for_label   
    by_alias                        alias
    by_fig_id                       FID1 FID2
    candidates_for_role             Genome Cutoff Role
    cas                             CID
    cas_to_cid                      cas
    catalyzed_by                    RID
    catalyzes                       role
    cgi_url
    clean_tmp
    close_genes                     FID distance
    comp2react                      CID
    contig_ln                       GenomeID Contig
    coupling_and_evidence           FID Bound SimCutoff CouplingCutoff
    crude_estimate_of_distance      GenomeID1 GenomeID2
    delete_all_peg_links            PEG
    delete_genomes                  G1 G2 G3 ...Gn
    delete_peg_link                 PEG URL
    displayable_reaction            RID
    dna_seq                         GenomeID Loc
    dsims                           FastaFile [MaxN MaxPsc Select]   *** select defaults to raw
    ec_to_maps                      EC
    ec_name                         EC
    expand_ec                       EC
    epoch_to_readable               [gives readable version of current time]
    export_chromosomal_clusters
    export_pch_pins
    export_set                      Relation SetName File
    exportable_subsystem            Subsystem
    external_calls                  PEG1 PEG2 PEG3...
    extract_seq                     ContigsFile loc 
    all_exchangable_subsystems
    best_bbh_candidates             Genome Cutoff Requested Known
    family_function                 Family
    fast_coupling                   FID Bound CouplingCutoff
    feature_aliases                 FID
    feature_annotations             FID
    feature_location                FID 
    file2N                          File
    ftype                           FID 
    function_of                     ID [all] [trans]
    genes_in_region                 GenomeID Contig Beg End 
    genome_of                       PEG
    genomes                         [complete|Pat]
    genome_counts                   [complete]
    genome_version                  GenomeID
    genus_species                   GenomeID
    get_translation                 ID
    get_translations                File [tab]
    h                               [command]  *** h h for help on how to use help ***
    hypo                            Function
    ids_in_family                   Family
    ids_in_set                      WhichSet Relation SetName
    in_cluster_with                 PEG
    in_family                       FID
    in_pch_pin_with                 FID
    in_sets                         Which Relation SetName
    is_archaeal                     GenomeID
    is_bacterial                    GenomeID
    is_eukaryotic                   GenomeID
    is_prokaryotic                  GenomeID
    is_exchangable_subsystem        Subsystem
    is_real_feature                 FID   
    largest_clusters                FileOfRoles user 
    load_all
    map_to_ecs                      map
    map_name                        map
    mapped_prot_ids                 ID
    maps_to_id                      ID 
    max                             n1 n2 n3 ...
    merged_related_annotations      PEG1 PEG2 ... PEGn
    min                             n1 n2 n3 ...
    names_of_compound
    neighborhood_of_role            role        
    org_of                          prot_id
    pegs_of                         GenomeID
    possibly_truncated              FID
    reaction2comp                   RID
    related_by_func_sim             PEG user
    reversible                      RID  
    rnas_of                         GenomeID
    roles_of_function               function
    same_func                       'function1'   'function2'
    search_index                    pattern
    seqs_with_role                  role who
    seqs_with_roles_in_genomes      genomes roles made_by
    sims                            PEG maxN maxP select
    sort_fids_by_taxonomy           FID1 FID2 FID3 ...
    sort_genomes_by_taxonomy        GenomeID1 GenomeID2 GenomeID3 ...
    sz_family                       family
    taxonomic_groups_of_complete    min_for_labels
    taxonomy_of                     GenomeID
    translatable                    prot_id
    translate_function              PEG user 
    translated_function_of          PEG user
    translation_length              ID
    unique_functions                PEGs user  [make the pegs comma separated]
    verify_dir                      dir
    
If you quickly browse through the commands that are supported by fig, you should get a fairly complete overview of the services provided by the API. We are not going to cover the use of fig in detail in this tutorial, but we do encourage you to experiment with it. By the way, you should type x to exit from it.

fig is ususally accessed in interactive mode (invoked with no arguments). However, if you wish to invoke it to process a single command, you can just put the command as a single argument on the command-line. Thus,

  fig 'add_peg_link fig|188.1.peg.96 <a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/>animation</a>'
would add a link to PEG fig|188.1.peg.96 while
  fig 'delete_all_peg_links fig|188.1.peg.96'
would delete all links from the PEG. Since adding links to PEGs is something you might wish to do, the easiest way to do it is to just create a file of commands of the sort shown, make the file executable, and then run it.

If you are a Perl programmer, eventually you will wish to peruse the fig source code to see how it is structured, how to add to it, and how to invoke services via the API.

We have briefly discussed the utility fig. Now it is time to discuss writing your own programs using the API. Basically, the entire API is encapsulated in the Perl module FIG.pm. To illustrate the style of programming we recommend, let's create a simple program to add a link to a PEG (rather than using fig). Adding a link to a PEG will cause the link to be displayed whenever anyone examines the PEG via the SEED web services.. The API defines a set of services that do many things; adding a link to a PEG is just one very minor service, but understanding exactly how one could invoke such a service is what this tutorial is about. To begin, use

        tool_hdr add_link
This creates a file called add_link and makes it executable. The file contains an initial set of Perl commands designed to establish the correct execution environment (where the libraries are, and so forth). You need to add your code to the end of the initialized file. We recommend adding the following lines:
use FIG;
my $fig = new FIG;

$usage = "usage: add_link PEG Link";

(
 ($peg  = shift @ARGV) &&
 ($link = shift @ARGV)
)
    || die $usage;

$fig->add_peg_links([$peg,$link]);

A short program to add a link to a PEG


We urge you to study this short program carefully, since it reflects the style we will use throughout this tutorial. There are a number of things worth noting: We suggest that you type in this program and gget it to run on your version of the SEED. Then, use fig to delete any links you added.

Accessing Data Relating to Genomes

Let us now begin a somewhat more systematic exploration of the services offered by the API. We will begin with genome-related services, of which there are relatively few in the SEED. Essentially, genomes are represented by IDs of the form xxx.y where xxx is the NCBI taxon ID, and y is a suffix to disambiguate situations in which you have multiple genomes with the same taxon ID.
use strict;
use FIG;
my $fig = new FIG;

my(%sofar,$genome,$which,$abbrev,$version);

foreach $genome ($fig->genomes("complete"))
{
    $which   = $fig->genus_species($genome);
    $abbrev  = &get_unique_abbreviation($which,\%sofar);
    $version = $fig->genome_version($genome);
    print "$genome\t$abbrev\t$version\t$which\n";
}

sub get_unique_abbreviation {
    my($which,$sofar) = @_;
    my($nonunique,$n);

    $nonunique = &FIG::abbrev($which);
    $nonunique =~ s/ //g;
    $nonunique =~ s/\.+$//;
    $n = $sofar->{$nonunique};
    if (! defined($n))
    {
        $n = $sofar->{$nonunique} = 1;
    }
    $sofar->{$nonunique}++;
    return $nonunique . ".$n";
}

An illustration of genome-related services


There are a few points worth noting about the short example: Here are just a few lines from the output of this little program:
197221.1        The.elon.BP.1   197221.1.3406724106     Thermosynechococcus elongatus BP-1
198094.1        Bac.anth.st.2   198094.1.2593352402     Bacillus anthracis str. Ames
198214.1        Shi.flex.2a.1   198214.1.1464268764     Shigella flexneri 2a str. 301
198215.1        Shi.flex.2a.2   198215.1.1756034760     Shigella flexneri 2a str. 2457T
What other genome-related services does the SEED offer? The only ones worth mentioning here relate to contigs. Here is a short example illustrating how to use those services. Basically, you can
  1. get a list of the contigs in a given genome,
  2. get the length of a specified contig, and
  3. extract sequence from a contig.
To illustrate these features, here is a short program that list all of the contigs for a designated genome. For each contig that is over 50 characters in size (and, they all should be), it displays the contig ID, the contig length, and the first 50 characters of sequence.
use strict;
use FIG;
my $fig = new FIG;

my $usage = "usage: contigs_for_genome Genome";

my($genome,@contigs,$contig,$len,$seq);

($genome  = shift @ARGV) 
    || die $usage;

@contigs = $fig->all_contigs($genome);
foreach $contig (@contigs)
{
    $len = $fig->contig_ln($genome,$contig);
    if ($len >= 50)
    {
        $seq = $fig->dna_seq($genome,$contig . "_1_50");
        print "$contig\t$len\t$seq\n";
    }
}

How to access services related to contigs


The services being used in this short example are as follows: This last service (dna_seq) requires some explanation. You need to know that
  1. Locations on a contig begin with 1 (not 0 as some computer scientists would prefer).
  2. Within the SEED, locations are represented as a comma-separated list of contiguous regions. A contiguous region is represented by a contig ID, a beginning position, and an ending position all separated by underscores. One fact that can be a little confusing is that contig IDs themselves may contain underscores. Thus,
            NC_000913_1_50
    
    would be used to represent the first fifty chracters of the contig NC_000913.
  3. Sections of sequence on the complementary strand are represented similarly (with the beginning point greater than the ending position). Thus,
            NC_000913_50_1
    
    would represent the sequence on the complementary strand of NC_000913 going from position 50 to position 1.
  4. The routine dna_seq will accept a location (i.e., a comma-separated list of contiguous regions), or even just a list of arguments each representing a contiguous region, and return the concatenation of the sequences from those locations. Thus,
            $fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100")
    or
            $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")
    
    would return 99 characters formed by deleting character 51 from the first 100 characters of NC_000913.

Accessing Data Relating to Features

The SEED was designed, like WIT and ERGO before it, to allow a variety of types of features. In fact, the SEED is essentially focused on protein-encoding genes, which I perversely named pegs rather than CDSs. It is true that WIT contained RNAs, inteins, introns, and repeat sequences, and ERGO contained RNAs and IS elements; however, the SEED really only has pegs and RNAs at this point (and, the RNAs are not adequately handled).

There is a fairly rich set of services for pegs, as you might guess from the web services offered by the SEED. As an introduction to some of these capabilities, consider the following short program:


use strict;
my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg);
my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region);
my($start_of_leftmost_feature,$end_of_rightmost_feature);
my($genome);

use FIG;
my $fig = new FIG;

$usage = "usage: features_around ID";

($id = shift @ARGV)
    || die $usage;

if ($peg = $fig->by_alias($id))
{
    if ($loc = $fig->feature_location($peg))
    {
        ($contig,$beg,$end) = $fig->boundaries_of($loc);
        if (defined($contig))
        {
            $start_reg = &FIG::min($beg,$end) - 5000;
            $end_reg   = &FIG::max($beg,$end) + 5000;
            $genome = &FIG::genome_of($peg);
            ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
                $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
            foreach $fid (@$features_in_region)
            {
                $loc1     = $fig->feature_location($fid);
                $aliases1 = $fig->feature_aliases($fid);
                $trunc    = $fig->possibly_truncated($fid);

                $pseq = $func = $prot_ln = "";

                if (&FIG::ftype($fid) eq "peg")
                {
                    if ($prot_ln = $fig->translation_length($fid))
                    {
                        $pseq = $fig->get_translation($fid);
                        $func = $fig->function_of($fid);
                    }
                }
                print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
            }
        }
    }
    else
    {
        print STDERR "Sorry, could not get the location of $fid\n";
    }
}
else
{
    print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
}

Displaying features close to a given feature


This short program takes as an argument an ID that corresponds to a feature. The ID can be a FIG feature ID of the form
        fig|xxxx.y.peg.zzzz
where xxxx.y is the genome ID and zzzz is a unique sequence number (these usually occur in the same order as the genes occur on the contigs, but this is not guaranteed to be true). The argument could also be any other ID known by the SEED that uniquely corresponds to the FIG ID. For example,
        features_around 'sp|Q9TL34'
would display twelve features (fig|31312.1.peg.1 through fig|31312.1.peg.12, since sp|Q9TL34 corresponds to fig|31312.1.peg.5. The program outputs a tab-separated file of the sort one might use as input to a spreadsheet like Excel. The columns in the output contain:
  1. the FIG ID,
  2. the location of the feature,
  3. a list of aliases for the gene,
  4. a flag indicating whether or not the gene occurs near the end of the contig (and, hence, is potentially truncated),
  5. the function currently assigned to the gene,
  6. the length of the protein sequence encoded by the gene, and
  7. the protein sequence encoded by the gene.
Of course, some of the features might not be PEGs, and in that case the last three fields are just empty.

Now, let's go through some of the details in the code:

The SEED comes with a directory containing the example programs we are displaying. If you find it hard to follow my comments or feel that I have failed left out some detail that you need to understand, please
  1. contact me (Ross@TheFIG.info), and
  2. copy the program and play with it. The most basic form of playing is to just do things like inserting
            print &Dumper($some_returned_variable);
    
    to see precisely what comes back.

Sequence Data

You have already been introduced to sequence data. You know how to access the DNA from a location in a contig from a designated genome, you know you to get locations of features, and you know how to acquire the translations for PEGs. In this short section we discuss just a few utilities that you might find useful.

The next short example illustrates how to access the reverse complement of a DNA sequence (i.e., the opposite strand), how to translate DNA sequences, and how to write out sequences in FASTA format. Let us look at it before thinking about additional complexities like how to handle nonstandard translations or properly translating the first codon:


use FIG;
my $fig = new FIG;

$usage = "usage: seq_utils Genome Location";

(
 ($genome = shift @ARGV) &&
 ($loc  = shift @ARGV) 
)
    || die $usage;

if ($seq = $fig->dna_seq($genome,$loc))
{
    $seqR = &FIG::reverse_comp($seq);

    $tran  = &FIG::translate($seq);
    $tranR = &FIG::translate($seqR);

    &FIG::display_id_and_seq("plus_strand",\$seq);
    &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);

    &FIG::display_id_and_seq("minus_strand",\$seqR);
    &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
}

Using some simple sequence utilities


The program takes two command-line arguments: a genome ID and a location. The program is supposed to write out the DNA from both the plus and minus strands from the given location to STDOUT, and it writes the translations of both strands to STDERR. Thus,
        seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
will write the following contents to the file stdout:
>plus_strand
ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG
GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT
GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA
TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA
CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA
GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG
GCTGGTGCAACTGTCACAGTGAAA
>minus_strand
TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC
TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT
CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG
CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC
AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC
TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT
AATTTGATTTGTAATAGTAGACAT
and the following two sequences to the file stderr
>plus_strand_translated
MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE
FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE
AGATVTVK
>minus_strand_translated
FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW
HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN
NLICNSRH
Now let us look at the routines within the program that produced these results: