[Bio] / FigKernelPackages / SAPtutorial.pm Repository:
ViewVC logotype

Annotation of /FigKernelPackages/SAPtutorial.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.9 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :     use strict;
3 :    
4 :     =head1 Using the Sapling Server -- A Tutorial
5 :    
6 :     =head2 What Is the Sapling Server?
7 :    
8 :     The B<Sapling Server> is a web service that allows you to access data stored in
9 :     the Sapling database, a large-scale MySQL database of genetic data. The Sapling
10 : parrello 1.5 database contains data on public genomes imported from I<RAST> (L<http://rast.nmpdr.org>)
11 : parrello 1.1 and curated by the annotation team of the Fellowship for Interpretation of
12 :     Genomes.
13 :    
14 :     The L<SAPserver> package wraps calls to the Sapling Server so that you can use
15 :     them in your PERL program.
16 :    
17 :     All Sapling Server services are documented in the L<SAP> module. Each method has
18 :     a signature like
19 :    
20 :     my $document = $sapObject->taxonomy_of($args);
21 :    
22 :     where C<$document> is usually a hash reference and C<$args> is B<always> a hash
23 :     reference. The method description includes a section called I<Parameter Hash
24 :     Fields> that describes the fields in C<$args>. For example, L<SAP/taxonomy_of>
25 : parrello 1.8 has a field called C<-ids> that is to be a list of genome IDs and an optional
26 :     field called C<-format> that indicates whether you want taxonomy groups
27 : parrello 1.1 represented by numbers, names, or both. To call the I<taxonomy_of> service,
28 :     you create a B<SAPserver> object and call a method with the same name as the
29 :     service.
30 :    
31 : parrello 1.4 use strict;
32 : parrello 1.1 use SAPserver;
33 :    
34 :     my $sapServer = SAPserver->new();
35 : parrello 1.7 my $resultHash = $sapServer->taxonomy_of({ -ids => ['360108.3', '100226.1'],
36 :     -format => 'names' });
37 : parrello 1.1 for my $id (keys %$resultHash) {
38 :     my $taxonomy = $resultHash->{$id};
39 : parrello 1.4 print "$id: " . join(" ", @$taxonomy) . "\n";
40 : parrello 1.1 }
41 :    
42 : parrello 1.4 The output from this program (reformatted slightly for readability) is shown below.
43 :    
44 :     360108.3: Bacteria, Proteobacteria, delta/epsilon subdivisions, Epsilonproteobacteria,
45 :     Campylobacterales, Campylobacteraceae, Campylobacter, Campylobacter jejuni,
46 :     Campylobacter jejuni subsp. jejuni, Campylobacter jejuni subsp. jejuni 260.94
47 :    
48 :     100226.1: Bacteria, Actinobacteria, Actinobacteria (class), Actinobacteridae,
49 :     Actinomycetales, Streptomycineae, Streptomycetaceae, Streptomyces,
50 :     Streptomyces coelicolor, Streptomyces coelicolor A3(2)
51 :    
52 : parrello 1.1 For convenience, you can specify the parameters as a simple hash rather
53 :     than a hash reference. So, for example, the above I<taxonomy_of> call could
54 :     also be written like this.
55 :    
56 : parrello 1.7 my $resultHash = $sapServer->taxonomy_of(-ids => ['360108.3', '100226.1'],
57 :     -format => 'names');
58 : parrello 1.1
59 :     =head2 A Simple Example: Genome Taxonomies
60 :    
61 :     Let's look at a simple program that pulls all the complete genomes from the
62 :     database and displays their representative genomes plus their texonomies in name
63 :     format.
64 :    
65 :     Three Sapling Server methods are needed to perform this function.
66 :    
67 :     =over 4
68 :    
69 :     =item *
70 :    
71 :     L<SAP/all_genomes> to get the list of genome IDs.
72 :    
73 :     =item *
74 :    
75 :     L<SAP/taxonomy_of> to get the genome taxonomies.
76 :    
77 :     =item *
78 :    
79 :     L<SAP/representative> to get the representative genome IDs.
80 :    
81 :     =back
82 :    
83 :     The program starts by connecting to the Sapling Server itself.
84 :    
85 :     use strict;
86 :     use SAPserver;
87 :    
88 :     my $sapServer = SAPserver->new();
89 :    
90 : parrello 1.8 Now we use I<all_genomes> to get a list of the complete genomes.
91 :     I<all_genomes> will normally return B<all> genomes, but we use the
92 :     C<-complete> option to restrict the output to those that are complete.
93 : parrello 1.1
94 : parrello 1.8 my $genomeIDs = $sapServer->all_genomes(-complete => 1);
95 : parrello 1.1
96 :     All we want are the genome IDs, so we use a PERL trick to convert the
97 : parrello 1.8 hash reference in C<$genomeIDs> to a list reference.
98 : parrello 1.1
99 :     $genomeIDs = [ keys %$genomeIDs ];
100 :    
101 :     Now we ask for the representatives and the taxonomies.
102 :    
103 : parrello 1.7 my $representativeHash = $sapServer->representative(-ids => $genomeIDs);
104 :     my $taxonomyHash = $sapServer->taxonomy_of(-ids => $genomeIDs,
105 :     -format => 'names');
106 : parrello 1.1
107 :     Our data is now contained in a pair of hash tables. The following loop
108 :     stiches them together to produce the output.
109 :    
110 :     for my $genomeID (@$genomeIDs) {
111 :     my $repID = $representativeHash->{$genomeID};
112 :     my $taxonomy = $taxonomyHash->{$genomeID};
113 :     # Format the taxonomy string.
114 :     my $taxonomyString = join(" ", @$taxonomy);
115 :     # Print the result.
116 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
117 :     }
118 :    
119 : parrello 1.4 An excerpt from the output of this script is shown below. The first column contains
120 :     a genome ID, the second contains the representative genome's ID, and the third is
121 :     the full taxonomy. Note that the two genomes with very close taxonomies have the
122 : parrello 1.8 same representative genome: this is the expected behavior.
123 : parrello 1.4
124 :     221109.1 221109.1 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Oceanobacillus Oceanobacillus iheyensis Oceanobacillus iheyensis HTE831
125 :     204722.1 204722.1 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Brucellaceae Brucella Brucella suis Brucella suis 1330
126 :     391037.3 369723.3 Bacteria Actinobacteria Actinobacteria (class) Actinobacteridae Actinomycetales Micromonosporineae Micromonosporaceae Salinispora Salinispora arenicola Salinispora arenicola CNS205
127 :     339670.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia Burkholderia cepacia complex Burkholderia cepacia Burkholderia cepacia AMMD
128 :     272560.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia pseudomallei group Burkholderia pseudomallei Burkholderia pseudomallei K96243
129 :     262768.1 262768.1 Bacteria Firmicutes Mollicutes Acholeplasmatales Acholeplasmataceae Candidatus Phytoplasma Candidatus Phytoplasma asteris Onion yellows phytoplasma Onion yellows phytoplasma OY-M
130 :     272624.3 272624.3 Bacteria Proteobacteria Gammaproteobacteria Legionellales Legionellaceae Legionella Legionella pneumophila Legionella pneumophila subsp. pneumophila Legionella pneumophila subsp. pneumophila str. Philadelphia 1
131 :     150340.3 150340.3 Bacteria Proteobacteria Gammaproteobacteria Vibrionales Vibrionaceae Vibrio Vibrio sp. Ex25
132 :     205914.1 205914.1 Bacteria Proteobacteria Gammaproteobacteria Pasteurellales Pasteurellaceae Histophilus Histophilus somni Haemophilus somnus 129PT
133 :     393117.3 169963.1 Bacteria Firmicutes Bacilli Bacillales Listeriaceae Listeria Listeria monocytogenes Listeria monocytogenes FSL J1-194
134 :     203119.1 203119.1 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium Clostridium thermocellum Clostridium thermocellum ATCC 27405
135 :     380703.5 380703.5 Bacteria Proteobacteria Gammaproteobacteria Aeromonadales Aeromonadaceae Aeromonas Aeromonas hydrophila Aeromonas hydrophila subsp. hydrophila Aeromonas hydrophila subsp. hydrophila ATCC 7966
136 :     259536.4 259536.4 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter Psychrobacter arcticus Psychrobacter arcticus 273-4
137 :    
138 : parrello 1.1 The Sapling Server processes lists of data (in this case a list of genome IDs)
139 :     so that you can minimize the overhead of sending requests across the web. You
140 :     can, however, specify a single ID instead of a list to a method call, and
141 :     this would allow you to structure your program with a more traditional loop, as
142 :     follows. To make this process simpler, you construct the Sapling Server object
143 :     in I<singleton mode>. In singleton mode, when you pass in only a single ID,
144 :     you get back an actual result instead of a hash reference.
145 :    
146 :     use strict;
147 :     use SAPserver;
148 :    
149 :     my $sapServer = SAPserver->new(singleton => 1);
150 : parrello 1.7 my $genomeIDs = $sapServer->all_genomes(-complete => 1);
151 : parrello 1.1 $genomeIDs = [ keys %$genomeIDs ];
152 :     for my $genomeID (@$genomeIDs) {
153 : parrello 1.7 my $repID = $sapServer->representative(-ids => $genomeID);
154 :     my $taxonomy = $sapServer->taxonomy_of(-ids => $genomeID,
155 :     -format => 'names');
156 : parrello 1.1 # Format the taxonomy string.
157 :     my $taxonomyString = join(" ", @$taxonomy);
158 :     # Print the result.
159 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
160 :     }
161 :    
162 :     At this point there is a risk that you are bewildered by all the options we've
163 :     presented-- hashes, hash references, singleton mode-- however, the goal here is
164 :     to provide a facility that will fit comfortably with different programming
165 :     styles. The server software tries to figure out how you want to use it and
166 :     adjusts accordingly.
167 :    
168 :     =head2 Specifying Gene IDs
169 :    
170 :     Many of the Sapling Server services return data on genes (a term we use rather
171 : parrello 1.8 loosely to include any kind of genetic I<locus> or I<feature>). The standard
172 : parrello 1.1 method for identifying a gene is the I<FIG ID>, an identifying string that
173 :     begins with the characters C<fig|> and includes the genome ID, the gene type,
174 :     and an additional number for uniqueness. For example, the FIG ID
175 :     C<fig|272558.1.peg.203> describes a protein encoding gene (I<peg>) in
176 :     Bacillus halodurans C-125 (I<272558.1>).
177 :    
178 :     Frequently, however, you will have a list of gene IDs from some other
179 : parrello 1.8 database (e.g. I<NCBI>, I<UniProt>) or in a community format such as Locus Tags
180 :     or gene names. Most services that take gene IDs as input allow you to specify a
181 :     C<-source> option that indicates the type of IDs being used. The acceptable
182 :     formats are as follows.
183 : parrello 1.1
184 :     =over 4
185 :    
186 :     =item CMR
187 :    
188 : parrello 1.5 I<Comprehensive Microbial Resource> (L<http://cmr.jcvi.org>) ID. The CMR IDs for
189 : parrello 1.1 C<fig|272558.1.peg.203> are C<10172815> and C<NTL01BH0204>.
190 :    
191 :     =item GENE
192 :    
193 :     Common Gene name. Often, these correspond to a large number of specified genes.
194 :     For example, C<accD>, which identifies the Acetyl-coenzyme A carboxyl
195 :     transferase beta chain, corresponds to 58 specific genes in the database.
196 :    
197 :     =item GeneID
198 :    
199 :     Common gene number. The common gene number for C<fig|272558.1.peg.203> is
200 :     C<891745>.
201 :    
202 :     =item KEGG
203 :    
204 : parrello 1.5 I<Kyoto Encyclopedia of Genes and Genomes> (L<http://www.genome.jp/kegg>) identifier.
205 : parrello 1.1 For example, the KEGG identifier for C<fig|158878.1.peg.2821> is C<sav:SAV2628>.
206 :    
207 :     =item LocusTag
208 :    
209 :     Common locus tag. For example, the locus tag of C<fig|380703.5.peg.250> is
210 :     C<ABK38410.1>.
211 :    
212 :     =item NCBI
213 :    
214 : parrello 1.5 I<NCBI> (L<http://www.ncbi.nlm.nih.gov>) accession number. The accession numbers for
215 : parrello 1.1 C<fig|272558.1.peg.203> are C<10172815>, C<15612766>, and C<81788207>.
216 :    
217 :     =item RefSeq
218 :    
219 : parrello 1.5 I<NCBI> (L<http://www.ncbi.nlm.nih.gov>) reference sequence identifier. The RefSeq
220 : parrello 1.1 identifier for C<fig|272558.1.peg.203> is C<NP_241069.1>.
221 :    
222 :     =item SEED
223 :    
224 :     FIG identifier. This is the default option.
225 :    
226 :     =item SwissProt
227 :    
228 : parrello 1.5 I<SwissProt> (L<http://www.expasy.ch/sprot>) identifier. For example, the SwissProt
229 : parrello 1.1 identifier for C<fig|243277.1.peg.3952> is C<O31153>.
230 :    
231 :     =item UniProt
232 :    
233 : parrello 1.5 I<UniProt> (L<http://www.uniprot.org>) identifier. The UniProt identifiers for
234 : parrello 1.1 C<fig|272558.1.peg.203> are C<Q9KGA9> and C<Q9KGA9_BACHD>.
235 :    
236 :     =back
237 :    
238 :     You can also mix identifiers of different types by specifying C<mixed>
239 :     for the source type. In this case, however, care must be taken, because the same
240 : parrello 1.9 identifier can have different meanings in different databases. You can also
241 :     specify C<prefixed> to use IDs with type prefixes. For RefSeq and SEED identifiers,
242 :     the prefixes are built-in; however, for other identifiers, the following
243 :     prefixes are used.
244 :    
245 :     =over 4
246 :    
247 :     =item CMR
248 :    
249 :     C<cmr|> -- for example, C<cmr|10172815> and C<cmr|NTL01BH0204>.
250 :    
251 :     =item GENE (Common gene name)
252 :    
253 :     C<GENE:> -- for example, C<GENE:accD>
254 :    
255 :     =item GeneID (Common gene number)
256 :    
257 :     C<GeneID:> -- for example, C<GeneID:891745>
258 :    
259 :     =item KEGG
260 :    
261 :     C<kegg|> -- for example, C<kegg|sav:SAV2628>
262 :    
263 :     =item LocusTag
264 :    
265 :     C<LocusTag:> -- for example, C<LocusTag:ABK38410.1>
266 :    
267 :     =item NBCI
268 :    
269 :     C<gi|> -- for example, C<gi|10172815>, C<gi|15612766>, C<gi|81788207>.
270 :    
271 :     =item UniProt
272 :    
273 :     C<uni|> -- for example, C<uni|Q9KGA9>, C<uni|Q9KGA9_BACHD>
274 :    
275 :     =back
276 : parrello 1.1
277 : parrello 1.6 =head2 Two Normal-Mode Examples
278 : parrello 1.1
279 :     The following examples use the Sapling Server in normal mode: that is, data
280 :     is sent to the server in batches and the results are stitched together
281 :     afterward. In this mode there is significantly reduced overhead, but there is
282 :     also a risk that the server request might time out. If this happens, you may
283 :     want to consider breaking the input into smaller batches. At some point, the
284 :     server system will perform sophisticated flow control to reduce the risk of
285 : parrello 1.8 timeout errors, but we are not yet there.
286 : parrello 1.1
287 :     =head3 Retrieving Functional Roles
288 :    
289 :     There are two primary methods for retrieving functional roles.
290 :    
291 :     =over 4
292 :    
293 :     =item *
294 :    
295 :     L<SAP/ids_to_functions> returns the current functional assignment of a gene.
296 :    
297 :     =item *
298 :    
299 :     L<SAP/ids_to_subsystems> returns the subsystems and roles of a gene.
300 :    
301 :     =back
302 :    
303 :     In both cases, the list of incoming gene IDs is given as a list via the C<ids>
304 :     parameter. It is assumed the IDs are FIG identifiers, but the C<source> parameter
305 :     can be used to specify a different ID type (see L</Specifying Gene IDs>).
306 :    
307 :     B<ids_to_functions> provides the basic functional role. Almost every gene in the
308 :     system will return a result with this method. The following example program
309 :     reads a file of UniProt IDs and produces their functional roles. Note that
310 :     we're assuming the file is a manageable size: since we're submitting the entire
311 :     file at once, we risk a timeout error if it's too big.
312 :    
313 :     use strict;
314 :     use SAPserver;
315 :    
316 :     my $sapServer = SAPserver->new();
317 :     # Read all the gene IDs.
318 :     my @genes = map { chomp; $_ } <STDIN>;
319 :     # Compute the functional roles.
320 : parrello 1.7 my $results = $sapServer->ids_to_functions(-ids => \@genes, -source => 'UniProt');
321 : parrello 1.1 # Loop through the genes.
322 :     for my $gene (@genes) {
323 :     # Did we find a result?
324 :     my $role = $results->{$gene};
325 :     if (defined $role) {
326 :     # Yes, print it.
327 :     print "$gene\t$role\n";
328 :     } else {
329 :     # No, emit a warning.
330 :     print STDERR "$gene was not found.\n";
331 :     }
332 :     }
333 :    
334 : parrello 1.8 Sample output from this script is shown below. Note that one of the input IDs
335 :     was not found.
336 : parrello 1.4
337 :     HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA
338 :     17KD_RICBR rickettsial 17 kDa surface antigen precursor
339 :     18KD_MYCLE 18 KDA antigen (HSP 16.7)
340 :     P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD
341 :     1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 / ACC synthase 4 (ACS4) / identical to gi:940370 [GB:U23481]; go_function: 1-aminocyclopropane-1-carboxylate synthase activity [goid 0016847]; go_process: ethylene biosynthesis [goid 0009693]; go_process: response to auxin stimulus [goid 0009733]
342 :     Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE
343 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) / N-acetylglutamate synthase (EC 2.3.1.1)
344 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25)
345 :    
346 :     Q8YY27_ANASP was not found.
347 :    
348 : parrello 1.8 B<ids_to_subsystems> returns roles in subsystems. Roles in subsystems have
349 :     several differences from general functional roles. Only half of the genes in the
350 :     database are currently associated with subsystems.A single gene may be in In
351 :     addition, multiple subsystems and may have multiple roles in a subsystem.
352 :    
353 : parrello 1.1 As a result, instead of a single string per incoming gene, B<ids_to_subsystems>
354 :     returns a list. Each element of the list consists of the role name followed by
355 :     the subsystem name. This makes the processing of the results a little more
356 :     complex, because we have to iterate through the list. An empty list indicates
357 :     the gene is not in a subsystem (although it could also indicate the gene was
358 :     not found).
359 :    
360 :     use SAPserver;
361 :    
362 :     my $sapServer = SAPserver->new();
363 :     # Read all the gene IDs.
364 :     my @genes = map { chomp; $_ } <STDIN>;
365 :     # Compute the functional roles.
366 : parrello 1.7 my $results = $sapServer->ids_to_subsystems(-ids => \@genes, -source => 'UniProt');
367 : parrello 1.1 # Loop through the genes.
368 :     for my $gene (@genes) {
369 :     # Did we find a result?
370 :     my $roleData = $results->{$gene};
371 :     if (! @$roleData) {
372 :     # Not in a subsystem: emit a warning.
373 :     print STDERR "$gene is not in a subsystem.\n";
374 :     } else {
375 :     # Yes, print the entries.
376 :     for my $ssData (@$roleData) {
377 :     print "$gene\t$ssData->[0]\t($ssData->[1])\n"
378 :     }
379 :     }
380 :     }
381 :    
382 : parrello 1.4 Sample output from this script is shown below. In this case, four of the input IDs
383 :     failed to find a result; however, two of them (C<O29118> and C<Q8PZM3>) produced multiple
384 :     results.
385 :    
386 :     HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA (NiFe hydrogenase maturation)
387 :     P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD (NiFe hydrogenase maturation)
388 :     Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE (NiFe hydrogenase maturation)
389 :     O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis extended)
390 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis extended)
391 :     O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis)
392 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis)
393 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (CBSS-316057.3.peg.1294)
394 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (tRNA nucleotidyltransferase)
395 :    
396 :     17KD_RICBR is not in a subsystem.
397 :     Q8YY27_ANASP is not in a subsystem.
398 :     18KD_MYCLE is not in a subsystem.
399 :     1A14_ARATH is not in a subsystem.
400 :    
401 : parrello 1.2 =head3 Genes in Subsystems for Genomes
402 : parrello 1.1
403 : parrello 1.2 This next example finds all genes in subsystems for a specific genome. We will
404 :     perform this operation in two phases. First, we will find the subsystems for
405 :     each genome, and then the genes for those subsystems. This requires two Sapling
406 :     Server functions.
407 : parrello 1.1
408 : parrello 1.2 =over 4
409 :    
410 :     =item *
411 :    
412 :     L<SAP/genomes_to_subsystems> returns a list of the subsystems for each
413 :     specified genome.
414 :    
415 :     =item *
416 :    
417 :     L<SAP/ids_in_subsystems> returns a list of the genes in each listed
418 :     subsystem for a specified genome.
419 :    
420 :     =back
421 :    
422 :     Our example program will loop through the genome IDs in an input file. For
423 :     each genome, it will call I<genomes_to_subsystems> to get the subsystem list,
424 :     and then feed the list to I<ids_in_subsystems> to get the result.
425 :    
426 :     L<SAP/ids_in_subsystems> returns gene IDs rather than taking them as input.
427 :     Like L<SAP/ids_to_subsystems> and L<SAP/ids_to_functions>, it takes a C<source>
428 :     parameter that indicates the type of ID desired (e.g. C<NCBI>, C<CMR>, C<LocusTag>).
429 :     In this case, however, the type describes how the gene IDs will be formatted in
430 :     the output rather than the type expected upon input. If a gene does not have an
431 :     ID for a particular source type (e.g. C<fig|100226.1.peg.3361> does not have a
432 :     locus tag), then the FIG identifier is used. The default source type (C<SEED>)
433 :     means that FIG identifiers will be used for everything.
434 :    
435 :     The program is given below.
436 :    
437 :     use strict;
438 :     use SAPserver;
439 :    
440 :     my $sapServer = SAPserver->new();
441 :     # Loop through the input file. Note that this loop will stop on the first
442 :     # blank line.
443 :     while (my $genomeID = <STDIN>) {
444 :     chomp $genomeID;
445 :     # Get the subsystems for this genome.
446 : parrello 1.7 my $subHash = $sapServer->genomes_to_subsystems(-ids => $genomeID);
447 : parrello 1.2 # The data returned for each genome (and in our case there's only one)
448 :     # includes the subsystem name and the variant code. The following
449 :     # statement strips away the variant codes, leaving only the subsystem
450 :     # names.
451 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
452 :     # Ask for the genes in each subsystem, using NCBI identifiers.
453 : parrello 1.7 my $roleHashes = $sapServer->ids_in_subsystems(-subsystems => $subList,
454 :     -genome => $genomeID,
455 :     -source => 'NCBI',
456 :     -roleForm => 'full');
457 : parrello 1.2 # The hash maps each subsystem ID to a hash of roles to lists of feature
458 :     # IDs. We therefore use three levels of nested loops to produce our
459 :     # output lines. At the top level we have a hash mapping subsystem IDs
460 :     # to role hashes.
461 :     for my $subsystem (sort keys %$roleHashes) {
462 :     my $geneHash = $roleHashes->{$subsystem};
463 :     # The gene hash maps each role to a list of gene IDs.
464 :     for my $role (sort keys %$geneHash) {
465 :     my $geneList = $geneHash->{$role};
466 :     # Finally, we loop through the gene IDs.
467 :     for my $gene (@$geneList) {
468 : parrello 1.4 print "$genomeID\t$gene\t$subsystem\t$role\n";
469 : parrello 1.2 }
470 :     }
471 :     }
472 :     }
473 : parrello 1.1
474 : parrello 1.4 An excerpt of the output is shown here.
475 :    
476 :     360108.3 85840651 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueC ATPase
477 :     360108.3 85841812 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueE Radical SAM
478 :     360108.3 85841520 Queuosine-Archaeosine Biosynthesis Queuosine biosynthesis QueD, PTPS-I
479 :     360108.3 85841162 Queuosine-Archaeosine Biosynthesis S-adenosylmethionine:tRNA ribosyltransferase-isomerase (EC 5.-.-.-)
480 :     360108.3 85842244 Queuosine-Archaeosine Biosynthesis tRNA-guanine transglycosylase (EC 2.4.2.29)
481 :     360108.3 85841653 Quinate degradation 3-dehydroquinate dehydratase II (EC 4.2.1.10)
482 :     360108.3 85840760 RNA polymerase bacterial DNA-directed RNA polymerase alpha subunit (EC 2.7.7.6)
483 :     360108.3 85841269 RNA polymerase bacterial DNA-directed RNA polymerase beta subunit (EC 2.7.7.6)
484 :     360108.3 85841348 RNA polymerase bacterial DNA-directed RNA polymerase beta' subunit (EC 2.7.7.6)
485 :     360108.3 85841887 RNA polymerase bacterial DNA-directed RNA polymerase omega subunit (EC 2.7.7.6)
486 :     360108.3 85841283 RNA processing and degradation, bacterial 3'-to-5' exoribonuclease RNase R
487 :     360108.3 85840820 RNA processing and degradation, bacterial Ribonuclease III (EC 3.1.26.3)
488 :     360108.3 85842272 Recycling of Peptidoglycan Amino Acids Aminoacyl-histidine dipeptidase (Peptidase D) (EC 3.4.13.3)
489 :    
490 : parrello 1.3 The I<ids_in_subsystems> service has several useful options for changing the nature
491 :     of the output. For example, in the above program each role is represented by a
492 :     full description (C<roleForm> set to C<full>). If you don't need the roles, you
493 :     can specify C<none> for the role form. You can also request that the gene IDs
494 :     be returned in a comma-separated list instead of a list data structure. These
495 :     two changes can drastically simplify the above program.
496 :    
497 :     use strict;
498 :     use SAPserver;
499 :    
500 :     my $sapServer = SAPserver->new();
501 :     # Loop through the input file. Note that this loop will stop on the first
502 :     # blank line.
503 :     while (my $genomeID = <STDIN>) {
504 :     chomp $genomeID;
505 :     # Get the subsystems for this genome.
506 : parrello 1.7 my $subHash = $sapServer->genomes_to_subsystems(-ids => $genomeID);
507 : parrello 1.3 # The data returned for each genome (and in our case there's only one)
508 :     # includes the subsystem name and the variant code. The following
509 :     # statement strips away the variant codes, leaving only the subsystem
510 :     # names.
511 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
512 :     # Ask for the genes in each subsystem, using NCBI identifiers.
513 : parrello 1.7 my $genesHash = $sapServer->ids_in_subsystems(-subsystems => $subList,
514 :     -genome => $genomeID,
515 :     -source => 'NCBI',
516 :     -roleForm => 'none',
517 :     -grouped => 1);
518 : parrello 1.3 # The hash maps each subsystem ID to a comma-delimited list of gene IDs.
519 :     for my $subsystem (sort keys %$genesHash) {
520 :     my $genes = $genesHash->{$subsystem};
521 :     print "$genomeID\t$subsystem\t$genes\n";
522 :     }
523 :     }
524 :    
525 : parrello 1.4 The sample output in this case looks quite different. The role information is missing,
526 :     and all the data for a subsystem is in a single line.
527 :    
528 :     360108.3 Queuosine-Archaeosine Biosynthesis 85841622, 85841791, 85840661, 85841162, 85841520, 85842244, 85840651, 85841812
529 :     360108.3 Quinate degradation 85841653
530 :     360108.3 RNA polymerase bacterial 85840760, 85841269, 85841348, 85841887
531 :     360108.3 RNA processing and degradation, bacterial 85840820, 85841283
532 :     360108.3 Recycling of Peptidoglycan Amino Acids 85842019, 85842272
533 :    
534 : parrello 1.6 =head2 A More Complicated Example: Operon Upstream Regions
535 :    
536 :     In this example we'll string several services together to perform a more
537 :     complex task: locating the upstream regions of operons involved in a particular
538 :     metabolic pathway. The theory is that we can look for a common pattern in
539 :     these regions.
540 :    
541 :     A metabolic pathway is a subsystem, so we'll enter our database via the
542 :     subsystems. To keep the data manageable, we'll limit our results to
543 :     specific genomes. The program we write will take as input a subsystem name
544 :     and a file of genome IDs.
545 :    
546 :     The worst part of the task is finding the operon for a gene. This involves
547 :     finding all the genes in a neighborhood and isolating the ones that point in
548 :     the correct direction. Fortunately, there is a Sapling Server function--
549 :     L<SAP/make_runs>-- that specifcally performs this task.
550 :    
551 :     To start our program, we create a L<SAPserver> object and pull the subsystem
552 :     name from the command-line parameters. This program is going to be doing a
553 :     lot of complicated, long-running stuff, so we'll usually want to deal with one
554 :     result at a time. To facilitate that, we construct the server helper in
555 :     singleton mode.
556 :    
557 :     use strict;
558 :     use SAPserver;
559 :    
560 :     my $sapServer = SAPserver->new(singleton => 1);
561 :     # Get the subsystem name.
562 :     my $ssName = $ARGV[0];
563 :    
564 :     Our main loop will read through the list of genomes from the standard input
565 :     and call a method I<PrintUpstream> to process each one. We're going to be a bit
566 :     cheesy here and allow our genome loop to stop on the first blank line.
567 :    
568 :     while (my $genomeID = <STDIN>) {
569 :     chomp $genomeID;
570 :     PrintUpstream($sapServer, $ssName, $genomeID);
571 :     }
572 :    
573 :     Now we need to write I<PrintUpstream>. Its first task is to find all the
574 :     genes in the genome that belong to the subsystem. A single call to
575 :     L<SAP/ids_in_subsystems> will get this information. We then feed the
576 :     results into L<SAP/make_runs> to get operons and call L<SAP/upstream> for
577 :     each operon. The program is given below.
578 :    
579 :     sub PrintUpstream {
580 :     my ($sapServer, $ssName, $genomeID) = @_;
581 :     # Because we specify "roleForm => none", we get back one long
582 :     # gene list.
583 : parrello 1.7 my $geneList = $sapServer->ids_in_subsystems(-subsystems => $ssName,
584 :     -genome => $genomeID,
585 :     -roleForm => 'none');
586 : parrello 1.6 # Convert the gene list to a comma-delimited string.
587 :     my $geneString = join(", ", @$geneList);
588 :     # Get a list of operons.
589 : parrello 1.7 my $opList = $sapServer->make_runs(-groups => $geneString);
590 : parrello 1.6 # Loop through the operons.
591 :     for my $opString (@$opList) {
592 :     # Get the first feature's ID.
593 :     my ($front) = split /\s*,/, $opString, 2;
594 :     # Grab its upstream region. We'll include the operon string as the
595 :     # comment text.
596 : parrello 1.7 my $fasta = $sapServer->upstream(-ids => $front,
597 :     -comments => { $front => $opString },
598 :     -skipGene => 1);
599 : parrello 1.6 # Print the result.
600 :     print "$fasta\n";
601 :     }
602 :     }
603 :    
604 :    
605 : parrello 1.1 =cut
606 :    
607 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3