[Bio] / FigWebPages / servers.html Repository:
ViewVC logotype

View of /FigWebPages/servers.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (download) (as text) (annotate)
Thu Nov 19 16:49:52 2009 UTC (9 years, 11 months ago) by disz
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2010_0928, rast_rel_2010_0827, rast_rel_2014_0729, myrast_33, rast_rel_2011_0928, rast_rel_2010_0526, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2011_0119, HEAD
new html for the servers page

<title>Web-Based Annotation of Protein Sequences</title>

<center><H1><A NAME="xtocid285920">Web-Based Annotation of Protein Sequences</A></H1></center>
<br><br><br>

<H1>Table of Contents</H1>

<UL><LI><A HREF="#xtocid285921">A. Introduction</A><LI><A HREF="#xtocid285922">B. Summary of the Individual Servers</A>
<UL><LI><A HREF="#xtocid285923">B.1 The Annotation Clearinghouse Server (ACH)</A><LI><A HREF="#xtocid285924">B.2 The FIGfam Server</A><LI><A HREF="#xtocid285925">B.3 The Subsystem Server</A><LI><A HREF="#xtocid285926">B.4 The Genomics ER Model Server</A><LI><A HREF="#xtocid285927">B.5  The Functional Coupling Server</A>
</UL><LI><A HREF="#xtocid285928">C. Distribution of the SEED server packages</A><LI><A HREF="#xtocid285929">D. Demonstration of the SEED servers</A>
<UL><LI><A HREF="#xtocid2859210">D.1 Example 1: Conversion of Gene and Protein IDs</A><LI><A HREF="#xtocid2859211">D.2 Example 2: Metabolic Reconstructions Provided for Complete Prokaryotic Genomes</A><LI><A HREF="#xtocid2859212">D.3 Example 3: Creating Custom Interfaces</A><LI><A HREF="#xtocid2859213">D.4 Example 4: Access to Functional Coupling (Conserved Contiguity) Data</A><LI><A HREF="#xtocid2859214">D.5 Services to Support Annotation of Genes</A>
</UL><LI><A HREF="#xtocid2859215">E. A collection of basic functions using the server</A>
<UL><LI><A HREF="#xtocid2859216">E.1 Find all features for a genome.</A><LI><A HREF="#xtocid2859217">E.2 Find Gene Function</A><LI><A HREF="#xtocid2859218">E.3 Find Gene Aliases</A><LI><A HREF="#xtocid2859219">E.4 Find Neighbors</A></UL>
</UL>
</UL>

<hr>
<h2><A NAME="xtocid285921">A. Introduction</A></h2>
Over the last year, we have been building a set of remotely accessible programmatic interfaces to
<a href="http://www.theseed.org">The SEED</a>. Here, we describe a set of web-based servers that implement the key components needed to access the currently encoded subsystems, or seek to infer clues of function based on conserved contiguity or even build your own annotation system.  These servers can all be easily accessed and are freely available to all users.  They form the core of a technology that will, we believe, be critical to the high-throughput annotation of prokaryotic genomes.


<h2><A NAME="xtocid285922">B. Summary of the Individual Servers</A></h2>

<h3><A NAME="xtocid285923">B.1 The Annotation Clearinghouse Server (ACH)</A></h3>

The most basic category of services relates to the resolution of IDs describing genes and the proteins encoded by a subset of those genes.  A system that attempts to benefit from the annotation efforts of numerous groups worldwide must directly confront the fact that numerous identifiers are employed to describe genes and proteins.  In some cases, it is very difficult to accurately establish correspondences between the genes represented by distinct IDs.  The situation with proteins is somewhat different in that one can consider the sequence of a protein as an identifier to which most other identifiers can easily be mapped (since most protein IDs can easily be associated with the sequence of the protein they identify).  From this observation it is just a short step to mapping protein IDs to MD5 hashes, and treating the MD5 hash values as unique identifiers of the protein sequences.

<p>
The ACH supports two queries, <i>equiv_precise</i> and <i>equiv_sequence</i>.   Each takes as input a set of protein IDs.  Each service returns a table corresponding to each input ID.  The table contains alternative IDs, annotations (i.e., product names) supplied by other groups to these IDs, and an indication of whether the sources of these product names considered them to be <i>expert annotations</i>.  Here, "expert annotation" is simply taken to mean the individual making the annotation felt that it was relatively reliable (and that they were qualified to make that judgment).
<p>
The difference between the two queries relates to what we mean by "alternative IDs".

<ol>
<li><b>equiv_precise</b> - Each of the alternative IDs refers to the protein encoded by the same gene from the same genome.  We call such IDs precisely equivalent.</li>
<p>
<li><b>equiv_sequence</b> - Each of the alternative IDs refers to a protein with the same sequence.  We call such IDs sequence equivalent.</li>

</ol>
<p>
Thus, one service provided by the ACH is to return all annotations for precisely equivalent IDs, and the other returns annotations for sequence equivalent IDs.

These services are essential to allowing access to annotations from different groups, for inter-converting IDs, and for locating which annotations are considered especially reliable.  See <a href="API/ACH.html" target="_blank">the ACH API</a> for a complete description of the API.


<h3><A NAME="xtocid285924">B.2 The FIGfam Server</A></h3>
The most common FIGfam server queries are used to determine the relationship between specific protein sequences and families. For instance, you can make a request to see if a reliable annotation can be made to a sequence using FIGfams. Or, you can ask a set of specific questions of the form <i>"Should this protein be placed in that family?"</i> That is, in this case the server is not being asked to locate an appropriate family, but just to determine whether or not a specific sequence can safely be placed in a specific family.

<p>Other types of queries are used to extract data on the FIGfams themselves.   You can ask for the entire list of FIGfam IDs, along with the family function associated with each ID or you can ask for all of the member sequences for one or more input FIGfam IDs.

<p>These queries offer access to the entire FIGfam collection, and an easy way to classify new protein sequences using the collection. The API description is <a href="API/FFserver.html" target="_blank"> here</a>.



<h3><A NAME="xtocid285925">B.3 The Subsystem Server</A></h3>
The subsystems server supports queries against the database of subsystems. The most common type of query (<i>is_in_subsystem</i>) just requests, for one or more gene IDs, the set of subsystems associated with each ID.   The server will take each input ID, locate the set of sequence-equivalent FIG-IDs (i.e., IDs used within the NMPDR to reference protein-encoding genes), and for each set of sequence-equivalent genes the set of subsystems that contain one or more of the genes.  More precisely, for each of the FIG-IDs that is part of a subsystem, the server will return the function associated with the FIG-ID (which may be one or more functional roles - multiple in the case of multi-functional proteins), and the name of the subsystem.

<p>
Another common service (<i>is_in_subsystem_with</i>)can be used to extract all of the FIG-IDs that together make up an instance of a subsystem in a specific genome.  The input is one or more FIG-IDs, and for each ID the entire set of FIG-IDs that together with the given ID implement a specific subsystem are returned.

<p>
Other services are used to extract data about the subsystems collection.  One, (<i>all_subsystems</i>) allows the user to extract the names of all of the subsystems in the collection, and another, (<i>subsystem_spreadsheet</i>),  allows the user to extract all of the details for each of a designated set of subsystems (which genomes are in each subsystem, which variant of the subsystem is implemented, and which FIG-IDs implement the variant for each genome).


<p>
The entire API is described <a href="API/SS.html" target="_blank"> here</a>.
A tutorial describing the use of the API is available at [URL].

<h3><A NAME="xtocid285926">B.4 The Genomics ER Model Server </A></h3>
The Genomics ER Model Server offers comprehensive access to all of the genomic data in the SEED.  The data in the SEED is described precisely by an entity-relationship diagram encoded as metadata in XML.  To see the overall ER diagram and the relations that implement it see <a href="http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/SaplingDatabase#IncludesIdentifier">the Sapling webpage</a>.

<p>We offer a programmatic interface that can easily be used to extract arbitrary data quite easily from the server.  The API is described <a href="API/SAP.html" target="_blank"> here</a>. A complete tutorial is offered in <a href="TUTORIAL/SAPtutorial.html" target="_blank"> SAP tutorial</a>.

<p>
<h3><A NAME="xtocid285927">B.5  The Functional Coupling Server </A></h3>
The API is  <a href="API/CO.html" target="_blank"> here</a>.
<hr>

<h2><A NAME="xtocid285928">C. Distribution of the SEED server packages </A></h2>

The SEED servers are located at Argonne National Laboratory and are accessed with a set of perl packages that we distribute. 

Download the server distribution <a href="downloads/sas.tgz"/>here.</a>
<p>

Untar the tarball into a directory of your choice, put its bin directory in your path and you should be ready to go.

<hr>

<h2><A NAME="xtocid285929">D. Demonstration of the SEED servers</A></h2>
<h3><A NAME="xtocid2859210">	D.1 Example 1: Conversion of Gene and Protein IDs</A></h3>

In <a href="server_papers/server_paper_example1.pl.html" target="_blank">example1</a> we illustrate some basic capabilities that relate to determining the set of IDs attached to specific protein sequences.  The program accepts a protein ID as input.  The ID may be one of several that are maintained by the SEED, UniProt, RefSeq, KEGG and other groups.  The program first accesses all IDs attached to identical protein sequences.  This can be a fairly large set in cases in which many very similar genomes have been sequenced.  
<p>
The program determines the set of IDs associated with proteins that have identical sequence to the input protein ID.  We call these sequence-equivalent proteins.  The program displays the associated protein sequence, and then it computes a table describing these proteins.  There will be at least one row for each of the computed IDs.   There will be several rows if multiple groups have used the same ID and attached functional assignments to the ID.  The columns in the table are as follows:
<p>
<ol>
<li>	The first column is the ID of the protein.</li>
<li>	The second is the description of the genome (usually genus, species and strain) associated with the ID.</li>
<li>	The third column indicates whether we believe that the ID corresponds to the product of the same gene from the same genome associated with the input ID (i.e., "Is this the product of the same gene from the same genome, possibly from different releases?").</li>
<li>	The fourth column gives the functional assignment associated with the protein ID. </li>
<li>	The fifth column gives the source of the assignment.</li>
<li>	The last column indicates whether or not the source of the assignment is an expert (i.e., whether or not this is an "expert assignment?").</li>
</ol> 
<p>
The need to map IDs between groups and compare asserted functions of proteins is quite basic.  It would be straightforward for any group to write a short CGI script using the capabilities illustrated in example1 that supported connecting protein IDs to literature (via PubMed), to structure data (when present), and so forth.  Every annotation team needs this class of functionality.

<p> 
<h4>Example 1 Discussion</h4>
With the server packages installed as described in section C, the code in <a href="server_papers/server_paper_example1.pl.html" target="_blank">example1</a> can be run as follows 

<pre>
> perl server_paper_example1.pl "fig|83333.1.peg.145"
</pre>


The work of this routine is in two parts. In lines 43-49, we use the ACH server to build a hash of identifiers for precisely equivalent genes used in determining the contents of column 3 later on.

<pre>
42.	    my %preciseHash;
43.	    my $precise_assertions_list = $achObject->equiv_precise(-ids => $id);
44.	    if ($precise_assertions_list->[0]) {
45.	        my ($inputID, $precise_assertions) = @{$precise_assertions_list->[0]};
46.	        for my $precise_assertion (@$precise_assertions) {
47.	            my ($newID, $function, $source, $expert) = @$precise_assertion;
48.	            $preciseHash{$newID} = 1;
49.	        }
</pre>



In lines 52-63, we again use the ACH server to get all sequence equivalent IDs and produce the output table shown below (truncated for this example).
<pre>
52.	    my $assertions = $achObject->equiv_sequence(-ids => $id);
53.	    if (! @$assertions) {
54.	        print STDERR "No results found.\n";
55.	    } else {
56.	        for my $assertion (@$assertions) {
57.	            my ($newID, $function, $source, $expert, $genomeName) = @$assertion;
58.	            $genomeName = '' if ! defined $genomeName;
59.	            my $column3 = ($preciseHash{$newID} ? 1 : 0);
60.	            print join("\t", $newID, $genomeName, $column3, $function, $source,
61.	                             $expert) . "\n";
62.	        }
63.	    }
</pre>

<h4>Output Table</h4>
<pre>
cmr|NT01SF0150		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT02EC0154		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT02SB0136		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT02SD0166		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT02SF0144		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT03EC0181		0	DksA homolog	CMR	0
cmr|NT04EC0178		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT04SF0143		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT04SS0162		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT10EC0149		0	RNA polymerase-binding protein DksA	CMR	0
.
.
.
</pre>

<h3><A NAME="xtocid2859211">D.2 Example 2: Metabolic Reconstructions Provided for Complete Prokaryotic Genomes</A></h3>

Given a set of functional roles one often wishes to understand what subsystems can be inferred from the set.  The second example, <a href="#example2" target="_blank">example2</a>, reads as input a set of functional roles and constructs a table of subsystems, along with their variation codes, that can be identified.  The data displayed in this simple example could form the start of a research project to gather the functional roles not connected to subsystems, to determine whether they were not connected because a small set of functional roles were not present in the input, and to seek candidates for such "missing functional roles".  The ability to easily map functional roles into subsystems will improve over time, as the SEED annotation effort improves its collection of encoded subsystems [PMID: 16214803].


<h4>Example 2 Discussion</h4>
With the server packages installed as described in section C, the code in <a href="server_papers/server_paper_example2.pl.html" target="_blank">example 2</a> can be run as follows 

<pre>
> perl server_paper_example2.pl < Buchnera_roles.tbl
</pre>

The work of this routine is in two parts. In lines 22-28, we use the Subsystem Server to construct the list of role-id pairs for use later on. 

<pre>
22.	    my $ssObject = SSserver->new();
23.	    my @pairs;
24.	    while () {
25.	        chomp;
26.	        my ($id, $role) = split(/\t/, $_);
27.	        push @pairs, [$role, $id];
28.	    }
</pre>

In lines 30-36, we call the Subsystem Server method <i>metabolic_reconstruction</i> and process the results to produce the output table.

<pre>
30.	    my $reconstruction = 
31.	        $ssObject->metabolic_reconstruction(-roles => \@pairs);
32.	    for my $record (@$reconstruction) {
33.	        my ($variantID, $role, $id) = @$record;
34.	        my ($subsysID, $code) = $variantID =~ /^(.+):(.+)$/;
35.	        print join("\t", $subsysID, $code, $role, $id) . "\n";
36.	    }
</pre>


<h4>Example 2 Input File (Truncated)</h4>
<pre>
fig|107806.1.peg.1      Anthranilate synthase, aminase component (EC 4.1.3.27) # TrpAa          82
fig|107806.1.peg.2      Anthranilate synthase, amidotransferase component (EC 4.1.3.27) # TrpAb Buchnera aphidicola str. APS (Acyrthosiphon pisum)      167
fig|107806.1.peg.3      Anthranilate synthase, amidotransferase component (EC 4.1.3.27) # TrpAb Buchnera aphidicola str. APS (Acyrthosiphon pisum)      167
fig|107806.1.peg.7      2-isopropylmalate synthase (EC 2.3.3.13)        Buchnera aphidicola str. APS (Acyrthosiphon pisum)      508
fig|107806.1.peg.8      3-isopropylmalate dehydrogenase (EC 1.1.1.85)   Buchnera aphidicola str. APS (Acyrthosiphon pisum)      353
fig|107806.1.peg.9      3-isopropylmalate dehydratase large subunit (EC 4.2.1.33)       Buchnera aphidicola str. APS (Acyrthosiphon pisum)      462
fig|107806.1.peg.10     3-isopropylmalate dehydratase small subunit (EC 4.2.1.33)               28
fig|107806.1.peg.11     tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA        Buchnera aphidicola str. APS (Acyrthosiphon pisum)      150
fig|107806.1.peg.12     ATP synthase A chain (EC 3.6.3.14)      Buchnera aphidicola str. APS (Acyrthosiphon pisum)      264
.
.
.
</pre>


<h4>Example 2 Output Table (Truncated) </h4>
<pre>
tRNA processing	1	Ribonuclease P protein component (EC 3.1.26.5)	fig|107806.1.peg.24
tRNA processing	1	tRNA(Ile)-lysidine synthetase	fig|107806.1.peg.111
tRNA processing	1	tRNA-specific adenosine-34 deaminase (EC 3.5.4.-)	fig|107806.1.peg.247
tRNA processing	1	tRNA delta(2)-isopentenylpyrophosphate transferase (EC 2.5.1.8)	fig|107806.1.peg.541
tRNA processing	1	tRNA-i(6)A37 methylthiotransferase	fig|107806.1.peg.421
tRNA processing	1	Ribonuclease T (EC 3.1.13.-)	fig|107806.1.peg.187
tRNA processing	1	tRNA pseudouridine synthase B (EC 4.2.1.70)	fig|107806.1.peg.361
tRNA processing	1	tRNA pseudouridine synthase A (EC 4.2.1.70)	fig|107806.1.peg.198
tRNA aminoacylation, Arg	1.00	Arginyl-tRNA synthetase (EC 6.1.1.19)	fig|107806.1.peg.239
Experimental-EFP	1	Translation elongation factor P	fig|107806.1.peg.29
mnm5U34 biosynthesis bacteria	1	tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA	fig|107806.1.peg.11
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusD	fig|107806.1.peg.507
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusA	fig|107806.1.peg.427
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusC	fig|107806.1.peg.506
mnm5U34 biosynthesis bacteria	1	GTPase and tRNA-U34 5-formylation enzyme TrmE	fig|107806.1.peg.26
mnm5U34 biosynthesis bacteria	1	tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase (EC 2.1.1.61)	fig|107806.1.peg.253
mnm5U34 biosynthesis bacteria	1	Cysteine desulfurase (EC 2.8.1.7)	fig|107806.1.peg.568
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusB	fig|107806.1.peg.505
Ribosome LSU bacterial	1	LSU ribosomal protein L18p (L5e)	fig|107806.1.peg.483
Ribosome LSU bacterial	1	LSU ribosomal protein L11p (L12e)	fig|107806.1.peg.47
Ribosome LSU bacterial	1	LSU ribosomal protein L23p (L23Ae)	fig|107806.1.peg.497
.
.
.
</pre>


<h3><A NAME="xtocid2859212">D.3 Example 3: Creating Custom Interfaces</A></h3>

Suppose that you had substantial expertise in graphical interfaces, understood the power of comparative analysis, and wished to support the ability to graphically display the chromosomal regions around a set of genes (normally from distinct genomes).  The SEED offers one alternative for doing this (see Figure X), but suppose that you did not like forcing users to find appropriate SEED IDs and you thought that you could develop a superior display.
<p>
  <a href="server_papers/server_paper_example3.pl.html" target="_blank">Example 3</a> illustrates the functions required to determine the location of a SEED gene encoding a specific protein and to acquire the genes from a given region centered on that location.  If you were to create a program to accept arbitrary protein IDs, use the conversion capabilities demonstrated in example1,  and display the regions graphically around these genes, you would have the core of a useful tool.  If you shaded genes from the same subsystem (determined using the capabilities described in example2), you could enhance the supported functionality.  Of course, you could also compute which genes could be connected to literature or structures and encode that data as well.


<h4>Example 3 Discussion</h4>
With the server packages installed as described in section C, the code in <a href="server_papers/server_paper_example3.pl.html" target="_blank">example 3</a> can be run as follows

<pre>
> perl server_paper_example3.pl "fig|100226.1.peg.3361" 2000
</pre>

Lines 12 - 14 get the location for the subject ID using the Sapling Server

<pre>
12.	my $sapObject = SAPserver->new();
13.	my $geneLocH  = $sapObject->fid_locations(-ids => [$geneID]);
14.	my $geneLoc   = $geneLocH->{$geneID}->[0];. 
</pre>



Lines 15 - 30  normalize the direction and compute the desired region.

<pre>
15.	if ($geneLoc =~ /^(\S+)_(\d+)([+-])(\d+)/)  # retrieve an encoded location
16.	{
17.	    my($contig,$beg,$strand,$length) = ($1,$2,$3,$4);
18.	    my ($left,$right);
19.	    if ($strand eq "+")
20.	    {
21.		($left,$right) = ($beg, $beg + ($length-1));
22.	    }
23.	    else
24.	    {
25.		($left,$right) = ($beg, $beg - ($length-1));
26.	    }
27.	    my $paddedLeft = ($left > $max_distance) ? $left  - $max_distance : 1;
28.	    my $paddedRight = $right + $max_distance;
29.	    my $sz          =  paddedRight + 1) - $paddedLeft;
30.	    my $region      = $contig . "_" . $paddedLeft . "+" . $sz;
</pre>

Lines 31 - 36 use the Sapling Server to retrieve all genes in the region and display the results.

<pre>
31.	    my $genesInRegionH = $sapObject->genes_in_region(-locations => [$region]);
32.	    my $genesInRegion   = $genesInRegionH->{$region};
33.	    foreach my $geneID2 (@$genesInRegion)
34.	    {
35.		print "$geneID2\n";
36.	    }
</pre>

<h4> Example 3 Output Table </h4>

<pre>
fig|100226.1.peg.3358	100226.1:NC_003888	3764143	+	867
fig|100226.1.peg.3359	100226.1:NC_003888	3765006	+	504
fig|100226.1.peg.3360	100226.1:NC_003888	3765814	+	360
fig|100226.1.peg.3361	100226.1:NC_003888	3766170	+	612
fig|100226.1.peg.3362	100226.1:NC_003888	3766852	+	489
fig|100226.1.peg.3363	100226.1:NC_003888	3767961	-	606
fig|100226.1.peg.3364	100226.1:NC_003888	3770099	-	2007
</pre>

<h3><A NAME="xtocid2859213">D.4 Example 4: Access to Functional Coupling (Conserved Contiguity) Data </A></h3>

A great deal has been learned from studying genes that tend to occur close to one another in diverse genomes [PMID: 11471247 - change date to 1998, PMID: 9787636, PMID: 10077608, PMID: 11230160, PMID: 18712303].  

<p>
<a href="server_papers/server_paper_example4.pl.html" target="_blank">Example 4</a> accesses the SEED server that offers access to the data we use to compute co-occurrence scores.  The program illustrates the potential for constructing custom tools by going through all of the protein-encoding genes in all of the complete prokaryotic genomes maintained within the SEED looking for "hypothetical proteins" that tend to co-occur with genes encoding functions that can be connected to subsystems.  The program constructs a table showing

<ol>
<li>the gene,</li>
<li>the function of the gene, 
<li>the genome id containing such a gene,</li>
<li>the description of the genome,</li>
<li>the non-hypothetical gene in a subsystem that appears to have the strongest co-occurrence score, </li>
<li>the co-occurrence score, and</li>
<li>the function assigned to the co-occurring gene contained in a subsystem.</li>
</ol>

We believe that there are many variations to this basic data mining capability that could be implemented on top of this basic co-occurrence data.


<h4>Example 4 Discussion</h4>
With the server packages installed as described in section C, the code in <a href="server_papers/server_paper_example4.pl.html" target="_blank">example 4</a> can be run as follows

<pre>
> perl server_paper_example4.pl 
</pre>

Line 10 retrieves all complete geneomes from the Sapling database
<pre>
10.	my $genomeHash = $sapObject->all_genomes(-complete => 1);
</pre>

<p>Lines 13 - 15 get all "Hypothetical" genes
<pre>
13.	    my $geneHash = $sapObject->feature_assignments(-genome => $genome,
14.	                                                   -type => 'peg');
15.	    my @hypotheticalGenes = grep { &SeedUtils::hypo($geneHash->{$_}) } sort keys %$geneHash
</pre>
<p>Lines 16 - 17  use the Functional Coupling Server to get a hash of all genes in the neighborhood of the hypothetical gene
<pre>
16.	    my $couplingHash = $coObject->conserved_in_neighborhood(-ids => \@hypotheticalGenes,
17.	                                                            -hash => 1);
</pre>

<p>
Lines 18 - 35 process the coupling data for each hypothetical gene and formats the output.
<pre>
18.	    for my $gene (@hypotheticalGenes) {
19.	        my $couplingList = $couplingHash->{$gene};
20.	        if (defined $couplingList) {
21.	            my $subHash = $sapObject->ids_to_subsystems(-ids => [ map { $_->[1]} @$couplingList ]);
22.	            my ($bestCoupler, $bestScore, $bestFunction) = (undef, 0, '');
23.	            for my $coupling (@$couplingList) {
24.	                my ($score, $coupler, $function) = @$coupling;
25.	                if ($subHash->{$coupler} && $score > $bestScore) {
26.	                    $bestCoupler = $coupler;
27.	                    $bestScore = $score;
28.	                    $bestFunction = $function;
29.	                }
30.	            }
31.	            if (defined $bestCoupler) {
32.	                print join("\t", $gene, $geneHash->{$gene}, $genome, $genomeName,
33.	                                 $bestCoupler, $bestScore, $bestFunction) . "\n";
34.	            }
35.	        }
</pre>

<h4>Example 4 output (truncated) </h4>




<pre>
fig|273035.4.peg.1016	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1018	31	DNA helicase
fig|273035.4.peg.1234	predicted metal-dependent hydrolase	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1233	80	Diacylglycerol kinase (EC 2.7.1.107)
fig|273035.4.peg.1496	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1491	9	Phage terminase large subunit
fig|273035.4.peg.1570	S1 RNA binding domain protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1571	10	Glucose-6-phosphate isomerase (EC 5.3.1.9)
fig|273035.4.peg.328	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.329	60	Putative tRNA-m1A22 methylase
fig|273035.4.peg.403	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.402	54	LSU ribosomal protein L27p
fig|273035.4.peg.60	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.61	9	hypothetical protein
fig|273035.4.peg.600	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.598	18	Recombination protein U homolog
fig|273035.4.peg.61	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.60	9	hypothetical protein
fig|273035.4.peg.710	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.708	12	LSU ribosomal protein L31p
.
.
.
</pre>

<h3><A NAME="xtocid2859214">D.5 Services to Support Annotation of Genes</A></h3>
<h4>D.5.a Identification of Genes </h4>
If one builds an annotation pipeline, one of the first steps involves identifying the putative genes.  <a href="server_papers/server_paper_example5.pl.html" target="_blank">Example 5</a> illustrates some basic functions that can be invoked via the servers to identify protein-encoding genes and rRNA-encoding genes, and tRNA-encoding genes.  These services utilize tools made available by JCVI, Niels Larsen, Gary Olsen, and Sean Eddy.  They offer reasonably accurate, easily-invoked services to locate genes in prokaryotic genomes.

<p>
<h4>D.5.b Assigning Functions to Encoded Proteins</h4>
Once genes have been identified, the next step usually relates to making initial estimates of function for the products of the protein-encoding genes.  <a href="server_papers/server_paper_example6.pl.html" target="_blank">Example 6</a> reads a fasta file of protein sequences and generates initial estimates of function.  There are two levels of service offered: the first is a very fast technique that can assign functions to most proteins that have been placed in FIGfams.  The second, much slower technique, involves invoking BLAT [PMID: 11932250] to seek an estimate of function against prokaryotic genes that have not been placed in FIGfams.  Both techniques are far more rapid than the use of BLAST [PMID: 2231712], but they are also miss some similarities BLAST detects.

<p>
<h4>D.5.c Creation of a Metabolic Reconstruction</h4>
<a href="server_papers/server_paper_example7.pl.html" target="_blank">Example 7</a> illustrates how to create an initial metabolic reconstruction from the assignments generated via a technique like those illustrated in example6.  It uses functionality already demonstrated in example2.  The program takes the detected functions, composes a set of functional roles (splitting multi-functional function assignments into the atomic functional roles), and accesses the inferred set of subsystems.

<h2><A NAME="xtocid2859215">E. A collection of basic functions using the server</A></h2>
We have developed a number of basic functions using the servers that can be run from the command line and piped together to create small systems. These should serve as models for others who wish to create their own custom bioinformatics systems using the servers.

<h3><A NAME="xtocid2859216">E.1 Find all features for a genome.</A></h3>
Simply retrieving all the features for a given genome is often the first step in an analysis sequence. This command is designed to be issued at the command line and takes as arguments a genome id and a feature type. The output is a single column table containing feature ids, suitable for piping into subsequent commands.

<p><i>svr_all_features</i> genome_id feature_type
<p>
This script returns all features of the specified type for the specified genome. Usage is <a href="API/svr_all_features.pl.html" target="_blank"> here</a>.
<p> The code for this server is <a href="server_papers/svr_all_features.pl.html" target="_blank"> here.</a> 

<p>
Here is an example of running this command:
<p>
<pre>
> svr_all_features 3702.1 peg
fig|3702.1.peg.1
fig|3702.1.peg.2
fig|3702.1.peg.3
fig|3702.1.peg.4
fig|3702.1.peg.5
fig|3702.1.peg.6
fig|3702.1.peg.7
fig|3702.1.peg.8
fig|3702.1.peg.9
fig|3702.1.peg.10
fig|3702.1.peg.11
fig|3702.1.peg.12
fig|3702.1.peg.13
fig|3702.1.peg.14
fig|3702.1.peg.15
.
.
.
</pre>


<h3><A NAME="xtocid2859217">E.2 Find Gene Function</A></h3>
Given a set of protein-encoding genes, a next step might be to retrieve the assigned function for each gene. This command takes as input a single column table of gene ids and returns a tab-separated two column table of gene id and function.

<p><i>svr_function_of</i> < table_of_gene_ids
<p>Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.  Usage is <a href="API/svr_function_of.pl.html" target="_blank"> here</a>.
<p> The code for this server is <a href="server_papers/svr_function_of.pl.html" target="_blank"> here.</a> 
<p>
Here is an example of running this command:
<p>

<pre>
> svr_all_features 3702.1 peg | svr_function_of
fig|3702.1.peg.1	photosystem II protein D1 (PsbA)
fig|3702.1.peg.2	maturase
fig|3702.1.peg.3	SSU ribosomal protein S16p, chloroplast
fig|3702.1.peg.4	Photosystem II protein PsbK
fig|3702.1.peg.5	Photosystem II protein PsbI
fig|3702.1.peg.6	ATP synthase alpha chain (EC 3.6.3.14)
fig|3702.1.peg.7	ATP synthase CF0 B chain
fig|3702.1.peg.8	ATP synthase C chain (EC 3.6.3.14)
fig|3702.1.peg.9	ATP synthase CF0 A chain
fig|3702.1.peg.10	SSU ribosomal protein S2p (SAe), chloroplast
fig|3702.1.peg.11	DNA-directed RNA polymerase delta (= beta'') subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.12	DNA-directed RNA polymerase gamma subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.13	DNA-directed RNA polymerase beta subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.14	Cytochrome b6-f complex subunit VIII (PetN)
fig|3702.1.peg.15	Photosystem II protein PsbM
.
.
.
</pre>
<h3><A NAME="xtocid2859218">E.3 Find Gene Aliases</A></h3>
Instead of function, perhaps you wish to see all the aliases by which a given feature or set of features is known in the SEED. You would use this command that behaves just like svr_function_of except it returns aliases:

<p><i>svr_aliases_of</i> < table_of_gene_ids
<p>Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.  Usage is <a href="API/svr_aliases_of.pl.html" target="_blank"> here</a>.
<p> The code for this server is <a href="server_papers/svr_aliases_of.pl.html" target="_blank"> here.</a> 
<p>
Here is an example of running this command:
<p>

<pre>
> svr_all_features 3702.1 peg | svr_aliases_of
fig|3702.1.peg.1	gi|112382048,gi|113200888,gi|114054364,gi|114107113,gi|114329726,gi|115531894,gi|134286292,gi|134286378,
gi|134286553,gi|134286643,gi|134286733,gi|134286999,gi|139387232,gi|139389076,gi|139389398,gi|139389623,gi|139389781,gi|13938993
1,gi|156597939,gi|156598592,gi|157695865,gi|159792928,gi|159793098,gi|159895452,gi|159895537,gi|166344112,gi|167391785,gi|169142
690,gi|169142840,gi|169142925,gi|169143011,gi|169794053,gi|6723714,sp|A4QJR4,sp|A4QJZ9,sp|A4QKH2,sp|A4QKR1,sp|A4QL00,sp|A4QLR3,s
p|B0Z4K6,sp|B0Z4U0,sp|B0Z524,sp|B0Z5A8,sp|B1A915,sp|B1NWD0,sp|P83755,sp|P83755,sp|P83756,sp|P83756,sp|Q06FY1,sp|Q09G66,sp|Q0G9Y2
,tr|A4QJZ9,tr|A4QKH2,tr|A4QL00,tr|A4QLR3,tr|A9QAZ4,tr|A9QAZ4,tr|A9QBW0,tr|A9QBW0,tr|Q06FY1,tr|Q09G66,tr|Q0G9Y2,fig|3702.1.peg.1,
gi|515374,gi|5881674,gi|7525013,fig|85636.1.peg.1,gi|13518299
fig|3702.1.peg.2	gi|12002371,gi|12002415,gi|12002417,gi|12002419,gi|12002421,gi|12002423,gi|12002425,gi|12002427,gi|12002
429,gi|12002431,gi|126022795,gi|5881675
fig|3702.1.peg.3	sp|P56806,sp|P56806,gi|5881676,gi|7525015
fig|3702.1.peg.4	sp|P56782,sp|P56782,fig|3702.1.peg.4,gi|5881677,gi|7525016
.
.
.

</pre>

<h3><A NAME="xtocid2859219">E.4 Find Neighbors</A></h3>
Beyond the basics of finding aliases and function, a more advanced analysis might require finding the PEGs that are in the neighborhood of a given PEG. This command takes as input a tab-separated table where the last field in each line contains the PEG for which a list of neighbors is being requested. It takes an argument telling how many neighbors to find to the left and right. The output file is the input file with an extra column appended at the end (containing a list of neighbors).

<p><i>svr_neighbors_of n</i> < table_of_gene_ids
<p>Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.  Usage is <a href="API/svr_neighbors_of.pl.html" target="_blank"> here</a>.
<p> The code for this server is <a href="server_papers/svr_neighbors_of.pl.html" target="_blank"> here.</a> 
<p>
Here is an example of running this command:
<p>

<pre>
> svr_all_features 3702.1 peg | svr_neighbors_of 5
fig|3702.1.peg.1	fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6
fig|3702.1.peg.2	fig|3702.1.peg.1,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7
fig|3702.1.peg.3	fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8
fig|3702.1.peg.4	fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8,fig|3702.1.peg.9
fig|3702.1.peg.5	fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8,fig|3702.1.peg.9,fig|3702.1.peg.10
.
.
.
</pre>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3