[Bio] / FigKernelPackages / SAPtutorial.pm Repository:
ViewVC logotype

View of /FigKernelPackages/SAPtutorial.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (download) (as text) (annotate)
Tue Aug 4 21:31:29 2009 UTC (10 years, 7 months ago) by parrello
Branch: MAIN
Changes since 1.2: +35 -0 lines
Added another example.

#!/usr/bin/perl -w
use strict;

=head1 Using the Sapling Server -- A Tutorial

=head2 What Is the Sapling Server?

The B<Sapling Server> is a web service that allows you to access data stored in
the Sapling database, a large-scale MySQL database of genetic data. The Sapling
database contains data on public genomes imported from L<RAST|http://rast.nmpdr.org>
and curated by the annotation team of the Fellowship for Interpretation of
Genomes.

The L<SAPserver> package wraps calls to the Sapling Server so that you can use
them in your PERL program.

All Sapling Server services are documented in the L<SAP> module. Each method has
a signature like

    my $document = $sapObject->taxonomy_of($args);

where C<$document> is usually a hash reference and C<$args> is B<always> a hash
reference. The method description includes a section called I<Parameter Hash
Fields> that describes the fields in C<$args>. For example, L<SAP/taxonomy_of>
has a field called C<ids> that is to be a list of genome IDs and an optional
field called C<format> that indicates whether you want taxonomy groups
represented by numbers, names, or both. To call the I<taxonomy_of> service,
you create a B<SAPserver> object and call a method with the same name as the
service.

    use SAPserver;

    my $sapServer = SAPserver->new();
    my $resultHash = $sapServer->taxonomy_of({ ids => ['360108.3', '100226.1'],
                                               format => 'names' });
    for my $id (keys %$resultHash) {
        my $taxonomy = $resultHash->{$id};
        print $id, join(" ", @$taxonomy);
    }

For convenience, you can specify the parameters as a simple hash rather
than a hash reference. So, for example, the above I<taxonomy_of> call could
also be written like this.

    my $resultHash = $sapServer->taxonomy_of(ids => ['360108.3', '100226.1'],
                                             format => 'names');

=head2 A Simple Example: Genome Taxonomies

Let's look at a simple program that pulls all the complete genomes from the
database and displays their representative genomes plus their texonomies in name
format.

Three Sapling Server methods are needed to perform this function.

=over 4

=item *

L<SAP/all_genomes> to get the list of genome IDs.

=item *

L<SAP/taxonomy_of> to get the genome taxonomies.

=item *

L<SAP/representative> to get the representative genome IDs.

=back

The program starts by connecting to the Sapling Server itself.

    use strict;
    use SAPserver;
    
    my $sapServer = SAPserver->new();

Now we use I<all_genomes> to get a list of the IDs for complete genomes.
I<all_genomes> will normally return B<all> genome IDs, but we use the
C<complete> option to restrict the output to complete genomes.

    my $genomeIDs = $sapServer->all_genomes(complete => 1);

All we want are the genome IDs, so we use a PERL trick to convert the
hash reference to a list reference.

    $genomeIDs = [ keys %$genomeIDs ];

Now we ask for the representatives and the taxonomies.

    my $representativeHash = $sapServer->representative(ids => $genomeIDs);
    my $taxonomyHash = $sapServer->taxonomy_of(ids => $genomeIDs,
                                               format => 'names');

Our data is now contained in a pair of hash tables. The following loop
stiches them together to produce the output.

    for my $genomeID (@$genomeIDs) {
        my $repID = $representativeHash->{$genomeID};
        my $taxonomy = $taxonomyHash->{$genomeID};
        # Format the taxonomy string.
        my $taxonomyString = join(" ", @$taxonomy);
        # Print the result.
        print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
    }

The Sapling Server processes lists of data (in this case a list of genome IDs)
so that you can minimize the overhead of sending requests across the web. You
can, however, specify a single ID instead of a list to a method call, and
this would allow you to structure your program with a more traditional loop, as
follows. To make this process simpler, you construct the Sapling Server object
in I<singleton mode>. In singleton mode, when you pass in only a single ID,
you get back an actual result instead of a hash reference.

    use strict;
    use SAPserver;
    
    my $sapServer = SAPserver->new(singleton => 1);
    my $genomeIDs = $sapServer->all_genomes(complete => 1);
    $genomeIDs = [ keys %$genomeIDs ];
    for my $genomeID (@$genomeIDs) {
        my $repID = $sapServer->representative(ids => $genomeID);
        my $taxonomy = $sapServer->taxonomy_of(ids => $genomeID,
                                               format => 'names');
        # Format the taxonomy string.
        my $taxonomyString = join(" ", @$taxonomy);
        # Print the result.
        print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
    }

At this point there is a risk that you are bewildered by all the options we've
presented-- hashes, hash references, singleton mode-- however, the goal here is
to provide a facility that will fit comfortably with different programming
styles. The server software tries to figure out how you want to use it and
adjusts accordingly.

=head2 Specifying Gene IDs

Many of the Sapling Server services return data on genes (a term we use rather
loosely to include any kind of genetic I<locus> or C<feature>). The standard
method for identifying a gene is the I<FIG ID>, an identifying string that
begins with the characters C<fig|> and includes the genome ID, the gene type,
and an additional number for uniqueness. For example, the FIG ID
C<fig|272558.1.peg.203> describes a protein encoding gene (I<peg>) in
Bacillus halodurans C-125 (I<272558.1>).

Frequently, however, you will have a list of gene IDs from some other
database (e.g. L<NCBI|http://www.ncbi.nlm.nih.gov>, L<UniProt|http://www.uniprot.org>)
or in a community format such as Locus Tags or gene names. Most services that
take gene IDs as input allow you to specify a C<source> option that indicates
the type of IDs being used. The acceptable formats are as follows.

=over 4

=item CMR

L<Comprehensive Microbial Resource ID|http://cmr.jcvi.org>. The CMR IDs for
C<fig|272558.1.peg.203> are C<10172815> and C<NTL01BH0204>.

=item GENE

Common Gene name. Often, these correspond to a large number of specified genes.
For example, C<accD>, which identifies the Acetyl-coenzyme A carboxyl
transferase beta chain, corresponds to 58 specific genes in the database.

=item GeneID

Common gene number. The common gene number for C<fig|272558.1.peg.203> is
C<891745>.

=item KEGG

L<Kyoto Encyclopedia of Genes and Genomes|http://www.genome.jp/kegg> identifier.
For example, the KEGG identifier for C<fig|158878.1.peg.2821> is C<sav:SAV2628>.

=item LocusTag

Common locus tag. For example, the locus tag of C<fig|380703.5.peg.250> is
C<ABK38410.1>.

=item NCBI

L<NCBI|http://www.ncbi.nlm.nih.gov> accession number. The accession numbers for
C<fig|272558.1.peg.203> are C<10172815>, C<15612766>, and C<81788207>.

=item RefSeq

L<NCBI|http://www.ncbi.nlm.nih.gov> reference sequence identifier. The RefSeq
identifier for C<fig|272558.1.peg.203> is C<NP_241069.1>.

=item SEED

FIG identifier. This is the default option.

=item SwissProt

L<SwissProt|http://www.expasy.ch/sprot> identifier. For example, the SwissProt
identifier for C<fig|243277.1.peg.3952> is C<O31153>.

=item UniProt

L<UniProt|http://www.uniprot.org> identifier. The UniProt identifiers for
C<fig|272558.1.peg.203> are C<Q9KGA9> and C<Q9KGA9_BACHD>.

=back

You can also mix identifiers of different types by specifying C<mixed>
for the source type. In this case, however, care must be taken, because the same
identifier can have different meanings in different databases.

=head2 Normal Mode Examples

The following examples use the Sapling Server in normal mode: that is, data
is sent to the server in batches and the results are stitched together
afterward. In this mode there is significantly reduced overhead, but there is
also a risk that the server request might time out. If this happens, you may
want to consider breaking the input into smaller batches. At some point, the
server system will perform sophisticated flow control to reduce the risk of
timeout errors, but we are not yet at that point.

=head3 Retrieving Functional Roles

There are two primary methods for retrieving functional roles.

=over 4

=item *

L<SAP/ids_to_functions> returns the current functional assignment of a gene.

=item *

L<SAP/ids_to_subsystems> returns the subsystems and roles of a gene.

=back

In both cases, the list of incoming gene IDs is given as a list via the C<ids>
parameter. It is assumed the IDs are FIG identifiers, but the C<source> parameter
can be used to specify a different ID type (see L</Specifying Gene IDs>).

B<ids_to_functions> provides the basic functional role. Almost every gene in the
system will return a result with this method. The following example program
reads a file of UniProt IDs and produces their functional roles.  Note that
we're assuming the file is a manageable size: since we're submitting the entire
file at once, we risk a timeout error if it's too big.

    use strict;
    use SAPserver;
    
    my $sapServer = SAPserver->new();
    # Read all the gene IDs.
    my @genes = map { chomp; $_ } <STDIN>;
    # Compute the functional roles.
    my $results = $sapServer->ids_to_functions(ids => \@genes, source => 'UniProt');
    # Loop through the genes.
    for my $gene (@genes) {
        # Did we find a result?
        my $role = $results->{$gene};
        if (defined $role) {
            # Yes, print it.
            print "$gene\t$role\n";
        } else {
            # No, emit a warning.
            print STDERR "$gene was not found.\n";
        }
    }

B<ids_to_subsystems> returns roles in subsystems. Roles in subsystems have 
several differences from general functional roles. A single gene may be in
multiple subsystems and may have multiple roles in a subsystem. In addition,
only half of the genes in the database are currently associated with subsystems.
As a result, instead of a single string per incoming gene, B<ids_to_subsystems>
returns a list. Each element of the list consists of the role name followed by
the subsystem name. This makes the processing of the results a little more
complex, because we have to iterate through the list. An empty list indicates
the gene is not in a subsystem (although it could also indicate the gene was
not found).

    use SAPserver;
    
    my $sapServer = SAPserver->new();
    # Read all the gene IDs.
    my @genes = map { chomp; $_ } <STDIN>;
    # Compute the functional roles.
    my $results = $sapServer->ids_to_subsystems(ids => \@genes, source => 'UniProt');
    # Loop through the genes.
    for my $gene (@genes) {
        # Did we find a result?
        my $roleData = $results->{$gene};
        if (! @$roleData) {
            # Not in a subsystem: emit a warning.
            print STDERR "$gene is not in a subsystem.\n";
        } else {
            # Yes, print the entries.
            for my $ssData (@$roleData) {
                print "$gene\t$ssData->[0]\t($ssData->[1])\n"
            }
        }
    }

=head3 Genes in Subsystems for Genomes

This next example finds all genes in subsystems for a specific genome. We will
perform this operation in two phases. First, we will find the subsystems for
each genome, and then the genes for those subsystems. This requires two Sapling
Server functions.

=over 4

=item *

L<SAP/genomes_to_subsystems> returns a list of the subsystems for each
specified genome.

=item *

L<SAP/ids_in_subsystems> returns a list of the genes in each listed
subsystem for a specified genome.

=back

Our example program will loop through the genome IDs in an input file. For
each genome, it will call I<genomes_to_subsystems> to get the subsystem list,
and then feed the list to I<ids_in_subsystems> to get the result.

L<SAP/ids_in_subsystems> returns gene IDs rather than taking them as input.
Like L<SAP/ids_to_subsystems> and L<SAP/ids_to_functions>, it takes a C<source>
parameter that indicates the type of ID desired (e.g. C<NCBI>, C<CMR>, C<LocusTag>).
In this case, however, the type describes how the gene IDs will be formatted in
the output rather than the type expected upon input. If a gene does not have an
ID for a particular source type (e.g. C<fig|100226.1.peg.3361> does not have a
locus tag), then the FIG identifier is used. The default source type (C<SEED>)
means that FIG identifiers will be used for everything.

The program is given below.

    use strict;
    use SAPserver;
    
    my $sapServer = SAPserver->new();
    # Loop through the input file. Note that this loop will stop on the first
    # blank line.
    while (my $genomeID = <STDIN>) {
        chomp $genomeID;
        # Get the subsystems for this genome.
        my $subHash = $sapServer->genomes_to_subsystems(ids => $genomeID);
        # The data returned for each genome (and in our case there's only one)
        # includes the subsystem name and the variant code. The following
        # statement strips away the variant codes, leaving only the subsystem
        # names.
        my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
        # Ask for the genes in each subsystem, using NCBI identifiers.
        my $roleHashes = $sapServer->ids_in_subsystems(subsystems => $subList,
                                                     genome => $genomeID,
                                                     source => 'NCBI',
                                                     roleForm => 'full');
        # The hash maps each subsystem ID to a hash of roles to lists of feature
        # IDs. We therefore use three levels of nested loops to produce our
        # output lines. At the top level we have a hash mapping subsystem IDs
        # to role hashes.
        for my $subsystem (sort keys %$roleHashes) {
            my $geneHash = $roleHashes->{$subsystem};
            # The gene hash maps each role to a list of gene IDs.
            for my $role (sort keys %$geneHash) {
                my $geneList = $geneHash->{$role};
                # Finally, we loop through the gene IDs.
                for my $gene (@$geneList) {
                    print "$genomeID\t$subsystem\t$role\t$gene\n";
                }
            }
        }
    }

The I<ids_in_subsystems> service has several useful options for changing the nature
of the output. For example, in the above program each role is represented by a
full description (C<roleForm> set to C<full>). If you don't need the roles, you
can specify C<none> for the role form. You can also request that the gene IDs
be returned in a comma-separated list instead of a list data structure. These
two changes can drastically simplify the above program.

    use strict;
    use SAPserver;
    
    my $sapServer = SAPserver->new();
    # Loop through the input file. Note that this loop will stop on the first
    # blank line.
    while (my $genomeID = <STDIN>) {
        chomp $genomeID;
        # Get the subsystems for this genome.
        my $subHash = $sapServer->genomes_to_subsystems(ids => $genomeID);
        # The data returned for each genome (and in our case there's only one)
        # includes the subsystem name and the variant code. The following
        # statement strips away the variant codes, leaving only the subsystem
        # names.
        my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
        # Ask for the genes in each subsystem, using NCBI identifiers.
        my $genesHash = $sapServer->ids_in_subsystems(subsystems => $subList,
                                                     genome => $genomeID,
                                                     source => 'NCBI',
                                                     roleForm => 'none',
                                                     grouped => 1);
        # The hash maps each subsystem ID to a comma-delimited list of gene IDs.
        for my $subsystem (sort keys %$genesHash) {
            my $genes = $genesHash->{$subsystem};
            print "$genomeID\t$subsystem\t$genes\n";
        }
    }

=cut

1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3