[Bio] / FigKernelPackages / SAPtutorial.pm Repository:
ViewVC logotype

Annotation of /FigKernelPackages/SAPtutorial.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :     use strict;
3 :    
4 :     =head1 Using the Sapling Server -- A Tutorial
5 :    
6 :     =head2 What Is the Sapling Server?
7 :    
8 :     The B<Sapling Server> is a web service that allows you to access data stored in
9 :     the Sapling database, a large-scale MySQL database of genetic data. The Sapling
10 :     database contains data on public genomes imported from L<RAST|http://rast.nmpdr.org>
11 :     and curated by the annotation team of the Fellowship for Interpretation of
12 :     Genomes.
13 :    
14 :     The L<SAPserver> package wraps calls to the Sapling Server so that you can use
15 :     them in your PERL program.
16 :    
17 :     All Sapling Server services are documented in the L<SAP> module. Each method has
18 :     a signature like
19 :    
20 :     my $document = $sapObject->taxonomy_of($args);
21 :    
22 :     where C<$document> is usually a hash reference and C<$args> is B<always> a hash
23 :     reference. The method description includes a section called I<Parameter Hash
24 :     Fields> that describes the fields in C<$args>. For example, L<SAP/taxonomy_of>
25 :     has a field called C<ids> that is to be a list of genome IDs and an optional
26 :     field called C<format> that indicates whether you want taxonomy groups
27 :     represented by numbers, names, or both. To call the I<taxonomy_of> service,
28 :     you create a B<SAPserver> object and call a method with the same name as the
29 :     service.
30 :    
31 :     use SAPserver;
32 :    
33 :     my $sapServer = SAPserver->new();
34 :     my $resultHash = $sapServer->taxonomy_of({ ids => ['360108.3', '100226.1'],
35 :     format => 'names' });
36 :     for my $id (keys %$resultHash) {
37 :     my $taxonomy = $resultHash->{$id};
38 :     print $id, join(" ", @$taxonomy);
39 :     }
40 :    
41 :     For convenience, you can specify the parameters as a simple hash rather
42 :     than a hash reference. So, for example, the above I<taxonomy_of> call could
43 :     also be written like this.
44 :    
45 :     my $resultHash = $sapServer->taxonomy_of(ids => ['360108.3', '100226.1'],
46 :     format => 'names');
47 :    
48 :     =head2 A Simple Example: Genome Taxonomies
49 :    
50 :     Let's look at a simple program that pulls all the complete genomes from the
51 :     database and displays their representative genomes plus their texonomies in name
52 :     format.
53 :    
54 :     Three Sapling Server methods are needed to perform this function.
55 :    
56 :     =over 4
57 :    
58 :     =item *
59 :    
60 :     L<SAP/all_genomes> to get the list of genome IDs.
61 :    
62 :     =item *
63 :    
64 :     L<SAP/taxonomy_of> to get the genome taxonomies.
65 :    
66 :     =item *
67 :    
68 :     L<SAP/representative> to get the representative genome IDs.
69 :    
70 :     =back
71 :    
72 :     The program starts by connecting to the Sapling Server itself.
73 :    
74 :     use strict;
75 :     use SAPserver;
76 :    
77 :     my $sapServer = SAPserver->new();
78 :    
79 :     Now we use I<all_genomes> to get a list of the IDs for complete genomes.
80 :     I<all_genomes> will normally return B<all> genome IDs, but we use the
81 :     C<complete> option to restrict the output to complete genomes.
82 :    
83 :     my $genomeIDs = $sapServer->all_genomes(complete => 1);
84 :    
85 :     All we want are the genome IDs, so we use a PERL trick to convert the
86 :     hash reference to a list reference.
87 :    
88 :     $genomeIDs = [ keys %$genomeIDs ];
89 :    
90 :     Now we ask for the representatives and the taxonomies.
91 :    
92 :     my $representativeHash = $sapServer->representative(ids => $genomeIDs);
93 :     my $taxonomyHash = $sapServer->taxonomy_of(ids => $genomeIDs,
94 :     format => 'names');
95 :    
96 :     Our data is now contained in a pair of hash tables. The following loop
97 :     stiches them together to produce the output.
98 :    
99 :     for my $genomeID (@$genomeIDs) {
100 :     my $repID = $representativeHash->{$genomeID};
101 :     my $taxonomy = $taxonomyHash->{$genomeID};
102 :     # Format the taxonomy string.
103 :     my $taxonomyString = join(" ", @$taxonomy);
104 :     # Print the result.
105 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
106 :     }
107 :    
108 :     The Sapling Server processes lists of data (in this case a list of genome IDs)
109 :     so that you can minimize the overhead of sending requests across the web. You
110 :     can, however, specify a single ID instead of a list to a method call, and
111 :     this would allow you to structure your program with a more traditional loop, as
112 :     follows. To make this process simpler, you construct the Sapling Server object
113 :     in I<singleton mode>. In singleton mode, when you pass in only a single ID,
114 :     you get back an actual result instead of a hash reference.
115 :    
116 :     use strict;
117 :     use SAPserver;
118 :    
119 :     my $sapServer = SAPserver->new(singleton => 1);
120 :     my $genomeIDs = $sapServer->all_genomes(complete => 1);
121 :     $genomeIDs = [ keys %$genomeIDs ];
122 :     for my $genomeID (@$genomeIDs) {
123 :     my $repID = $sapServer->representative(ids => $genomeID);
124 :     my $taxonomy = $sapServer->taxonomy_of(ids => $genomeID,
125 :     format => 'names');
126 :     # Format the taxonomy string.
127 :     my $taxonomyString = join(" ", @$taxonomy);
128 :     # Print the result.
129 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
130 :     }
131 :    
132 :     At this point there is a risk that you are bewildered by all the options we've
133 :     presented-- hashes, hash references, singleton mode-- however, the goal here is
134 :     to provide a facility that will fit comfortably with different programming
135 :     styles. The server software tries to figure out how you want to use it and
136 :     adjusts accordingly.
137 :    
138 :     =head2 Specifying Gene IDs
139 :    
140 :     Many of the Sapling Server services return data on genes (a term we use rather
141 :     loosely to include any kind of genetic I<locus> or C<feature>). The standard
142 :     method for identifying a gene is the I<FIG ID>, an identifying string that
143 :     begins with the characters C<fig|> and includes the genome ID, the gene type,
144 :     and an additional number for uniqueness. For example, the FIG ID
145 :     C<fig|272558.1.peg.203> describes a protein encoding gene (I<peg>) in
146 :     Bacillus halodurans C-125 (I<272558.1>).
147 :    
148 :     Frequently, however, you will have a list of gene IDs from some other
149 :     database (e.g. L<NCBI|http://www.ncbi.nlm.nih.gov>, L<UniProt|http://www.uniprot.org>)
150 :     or in a community format such as Locus Tags or gene names. Most services that
151 :     take gene IDs as input allow you to specify a C<source> option that indicates
152 :     the type of IDs being used. The acceptable formats are as follows.
153 :    
154 :     =over 4
155 :    
156 :     =item CMR
157 :    
158 :     L<Comprehensive Microbial Resource ID|http://cmr.jcvi.org>. The CMR IDs for
159 :     C<fig|272558.1.peg.203> are C<10172815> and C<NTL01BH0204>.
160 :    
161 :     =item GENE
162 :    
163 :     Common Gene name. Often, these correspond to a large number of specified genes.
164 :     For example, C<accD>, which identifies the Acetyl-coenzyme A carboxyl
165 :     transferase beta chain, corresponds to 58 specific genes in the database.
166 :    
167 :     =item GeneID
168 :    
169 :     Common gene number. The common gene number for C<fig|272558.1.peg.203> is
170 :     C<891745>.
171 :    
172 :     =item KEGG
173 :    
174 :     L<Kyoto Encyclopedia of Genes and Genomes|http://www.genome.jp/kegg> identifier.
175 :     For example, the KEGG identifier for C<fig|158878.1.peg.2821> is C<sav:SAV2628>.
176 :    
177 :     =item LocusTag
178 :    
179 :     Common locus tag. For example, the locus tag of C<fig|380703.5.peg.250> is
180 :     C<ABK38410.1>.
181 :    
182 :     =item NCBI
183 :    
184 :     L<NCBI|http://www.ncbi.nlm.nih.gov> accession number. The accession numbers for
185 :     C<fig|272558.1.peg.203> are C<10172815>, C<15612766>, and C<81788207>.
186 :    
187 :     =item RefSeq
188 :    
189 :     L<NCBI|http://www.ncbi.nlm.nih.gov> reference sequence identifier. The RefSeq
190 :     identifier for C<fig|272558.1.peg.203> is C<NP_241069.1>.
191 :    
192 :     =item SEED
193 :    
194 :     FIG identifier. This is the default option.
195 :    
196 :     =item SwissProt
197 :    
198 :     L<SwissProt|http://www.expasy.ch/sprot> identifier. For example, the SwissProt
199 :     identifier for C<fig|243277.1.peg.3952> is C<O31153>.
200 :    
201 :     =item UniProt
202 :    
203 :     L<UniProt|http://www.uniprot.org> identifier. The UniProt identifiers for
204 :     C<fig|272558.1.peg.203> are C<Q9KGA9> and C<Q9KGA9_BACHD>.
205 :    
206 :     =back
207 :    
208 :     You can also mix identifiers of different types by specifying C<mixed>
209 :     for the source type. In this case, however, care must be taken, because the same
210 :     identifier can have different meanings in different databases.
211 :    
212 :     =head2 Normal Mode Examples
213 :    
214 :     The following examples use the Sapling Server in normal mode: that is, data
215 :     is sent to the server in batches and the results are stitched together
216 :     afterward. In this mode there is significantly reduced overhead, but there is
217 :     also a risk that the server request might time out. If this happens, you may
218 :     want to consider breaking the input into smaller batches. At some point, the
219 :     server system will perform sophisticated flow control to reduce the risk of
220 :     timeout errors, but we are not yet at that point.
221 :    
222 :     =head3 Retrieving Functional Roles
223 :    
224 :     There are two primary methods for retrieving functional roles.
225 :    
226 :     =over 4
227 :    
228 :     =item *
229 :    
230 :     L<SAP/ids_to_functions> returns the current functional assignment of a gene.
231 :    
232 :     =item *
233 :    
234 :     L<SAP/ids_to_subsystems> returns the subsystems and roles of a gene.
235 :    
236 :     =back
237 :    
238 :     In both cases, the list of incoming gene IDs is given as a list via the C<ids>
239 :     parameter. It is assumed the IDs are FIG identifiers, but the C<source> parameter
240 :     can be used to specify a different ID type (see L</Specifying Gene IDs>).
241 :    
242 :     B<ids_to_functions> provides the basic functional role. Almost every gene in the
243 :     system will return a result with this method. The following example program
244 :     reads a file of UniProt IDs and produces their functional roles. Note that
245 :     we're assuming the file is a manageable size: since we're submitting the entire
246 :     file at once, we risk a timeout error if it's too big.
247 :    
248 :     use strict;
249 :     use SAPserver;
250 :    
251 :     my $sapServer = SAPserver->new();
252 :     # Read all the gene IDs.
253 :     my @genes = map { chomp; $_ } <STDIN>;
254 :     # Compute the functional roles.
255 :     my $results = $sapServer->ids_to_functions(ids => \@genes, source => 'UniProt');
256 :     # Loop through the genes.
257 :     for my $gene (@genes) {
258 :     # Did we find a result?
259 :     my $role = $results->{$gene};
260 :     if (defined $role) {
261 :     # Yes, print it.
262 :     print "$gene\t$role\n";
263 :     } else {
264 :     # No, emit a warning.
265 :     print STDERR "$gene was not found.\n";
266 :     }
267 :     }
268 :    
269 :     B<ids_to_subsystems> returns roles in subsystems. Roles in subsystems have
270 :     several differences from general functional roles. A single gene may be in
271 :     multiple subsystems and may have multiple roles in a subsystem. In addition,
272 :     only half of the genes in the database are currently associated with subsystems.
273 :     As a result, instead of a single string per incoming gene, B<ids_to_subsystems>
274 :     returns a list. Each element of the list consists of the role name followed by
275 :     the subsystem name. This makes the processing of the results a little more
276 :     complex, because we have to iterate through the list. An empty list indicates
277 :     the gene is not in a subsystem (although it could also indicate the gene was
278 :     not found).
279 :    
280 :     use SAPserver;
281 :    
282 :     my $sapServer = SAPserver->new();
283 :     # Read all the gene IDs.
284 :     my @genes = map { chomp; $_ } <STDIN>;
285 :     # Compute the functional roles.
286 :     my $results = $sapServer->ids_to_subsystems(ids => \@genes, source => 'UniProt');
287 :     # Loop through the genes.
288 :     for my $gene (@genes) {
289 :     # Did we find a result?
290 :     my $roleData = $results->{$gene};
291 :     if (! @$roleData) {
292 :     # Not in a subsystem: emit a warning.
293 :     print STDERR "$gene is not in a subsystem.\n";
294 :     } else {
295 :     # Yes, print the entries.
296 :     for my $ssData (@$roleData) {
297 :     print "$gene\t$ssData->[0]\t($ssData->[1])\n"
298 :     }
299 :     }
300 :     }
301 :    
302 : parrello 1.2 =head3 Genes in Subsystems for Genomes
303 : parrello 1.1
304 : parrello 1.2 This next example finds all genes in subsystems for a specific genome. We will
305 :     perform this operation in two phases. First, we will find the subsystems for
306 :     each genome, and then the genes for those subsystems. This requires two Sapling
307 :     Server functions.
308 : parrello 1.1
309 : parrello 1.2 =over 4
310 :    
311 :     =item *
312 :    
313 :     L<SAP/genomes_to_subsystems> returns a list of the subsystems for each
314 :     specified genome.
315 :    
316 :     =item *
317 :    
318 :     L<SAP/ids_in_subsystems> returns a list of the genes in each listed
319 :     subsystem for a specified genome.
320 :    
321 :     =back
322 :    
323 :     Our example program will loop through the genome IDs in an input file. For
324 :     each genome, it will call I<genomes_to_subsystems> to get the subsystem list,
325 :     and then feed the list to I<ids_in_subsystems> to get the result.
326 :    
327 :     L<SAP/ids_in_subsystems> returns gene IDs rather than taking them as input.
328 :     Like L<SAP/ids_to_subsystems> and L<SAP/ids_to_functions>, it takes a C<source>
329 :     parameter that indicates the type of ID desired (e.g. C<NCBI>, C<CMR>, C<LocusTag>).
330 :     In this case, however, the type describes how the gene IDs will be formatted in
331 :     the output rather than the type expected upon input. If a gene does not have an
332 :     ID for a particular source type (e.g. C<fig|100226.1.peg.3361> does not have a
333 :     locus tag), then the FIG identifier is used. The default source type (C<SEED>)
334 :     means that FIG identifiers will be used for everything.
335 :    
336 :     The program is given below.
337 :    
338 :     use strict;
339 :     use SAPserver;
340 :    
341 :     my $sapServer = SAPserver->new();
342 :     # Loop through the input file. Note that this loop will stop on the first
343 :     # blank line.
344 :     while (my $genomeID = <STDIN>) {
345 :     chomp $genomeID;
346 :     # Get the subsystems for this genome.
347 :     my $subHash = $sapServer->genomes_to_subsystems(ids => $genomeID);
348 :     # The data returned for each genome (and in our case there's only one)
349 :     # includes the subsystem name and the variant code. The following
350 :     # statement strips away the variant codes, leaving only the subsystem
351 :     # names.
352 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
353 :     # Ask for the genes in each subsystem, using NCBI identifiers.
354 :     my $roleHashes = $sapServer->ids_in_subsystems(subsystems => $subList,
355 :     genome => $genomeID,
356 :     source => 'NCBI',
357 :     roleForm => 'full');
358 :     # The hash maps each subsystem ID to a hash of roles to lists of feature
359 :     # IDs. We therefore use three levels of nested loops to produce our
360 :     # output lines. At the top level we have a hash mapping subsystem IDs
361 :     # to role hashes.
362 :     for my $subsystem (sort keys %$roleHashes) {
363 :     my $geneHash = $roleHashes->{$subsystem};
364 :     # The gene hash maps each role to a list of gene IDs.
365 :     for my $role (sort keys %$geneHash) {
366 :     my $geneList = $geneHash->{$role};
367 :     # Finally, we loop through the gene IDs.
368 :     for my $gene (@$geneList) {
369 :     print "$genomeID\t$subsystem\t$role\t$gene\n";
370 :     }
371 :     }
372 :     }
373 :     }
374 : parrello 1.1
375 : parrello 1.3 The I<ids_in_subsystems> service has several useful options for changing the nature
376 :     of the output. For example, in the above program each role is represented by a
377 :     full description (C<roleForm> set to C<full>). If you don't need the roles, you
378 :     can specify C<none> for the role form. You can also request that the gene IDs
379 :     be returned in a comma-separated list instead of a list data structure. These
380 :     two changes can drastically simplify the above program.
381 :    
382 :     use strict;
383 :     use SAPserver;
384 :    
385 :     my $sapServer = SAPserver->new();
386 :     # Loop through the input file. Note that this loop will stop on the first
387 :     # blank line.
388 :     while (my $genomeID = <STDIN>) {
389 :     chomp $genomeID;
390 :     # Get the subsystems for this genome.
391 :     my $subHash = $sapServer->genomes_to_subsystems(ids => $genomeID);
392 :     # The data returned for each genome (and in our case there's only one)
393 :     # includes the subsystem name and the variant code. The following
394 :     # statement strips away the variant codes, leaving only the subsystem
395 :     # names.
396 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
397 :     # Ask for the genes in each subsystem, using NCBI identifiers.
398 :     my $genesHash = $sapServer->ids_in_subsystems(subsystems => $subList,
399 :     genome => $genomeID,
400 :     source => 'NCBI',
401 :     roleForm => 'none',
402 :     grouped => 1);
403 :     # The hash maps each subsystem ID to a comma-delimited list of gene IDs.
404 :     for my $subsystem (sort keys %$genesHash) {
405 :     my $genes = $genesHash->{$subsystem};
406 :     print "$genomeID\t$subsystem\t$genes\n";
407 :     }
408 :     }
409 :    
410 : parrello 1.1 =cut
411 :    
412 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3