[Bio] / FigKernelPackages / SAPtutorial.pm Repository:
ViewVC logotype

Annotation of /FigKernelPackages/SAPtutorial.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.8 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :     use strict;
3 :    
4 :     =head1 Using the Sapling Server -- A Tutorial
5 :    
6 :     =head2 What Is the Sapling Server?
7 :    
8 :     The B<Sapling Server> is a web service that allows you to access data stored in
9 :     the Sapling database, a large-scale MySQL database of genetic data. The Sapling
10 : parrello 1.5 database contains data on public genomes imported from I<RAST> (L<http://rast.nmpdr.org>)
11 : parrello 1.1 and curated by the annotation team of the Fellowship for Interpretation of
12 :     Genomes.
13 :    
14 :     The L<SAPserver> package wraps calls to the Sapling Server so that you can use
15 :     them in your PERL program.
16 :    
17 :     All Sapling Server services are documented in the L<SAP> module. Each method has
18 :     a signature like
19 :    
20 :     my $document = $sapObject->taxonomy_of($args);
21 :    
22 :     where C<$document> is usually a hash reference and C<$args> is B<always> a hash
23 :     reference. The method description includes a section called I<Parameter Hash
24 :     Fields> that describes the fields in C<$args>. For example, L<SAP/taxonomy_of>
25 : parrello 1.8 has a field called C<-ids> that is to be a list of genome IDs and an optional
26 :     field called C<-format> that indicates whether you want taxonomy groups
27 : parrello 1.1 represented by numbers, names, or both. To call the I<taxonomy_of> service,
28 :     you create a B<SAPserver> object and call a method with the same name as the
29 :     service.
30 :    
31 : parrello 1.4 use strict;
32 : parrello 1.1 use SAPserver;
33 :    
34 :     my $sapServer = SAPserver->new();
35 : parrello 1.7 my $resultHash = $sapServer->taxonomy_of({ -ids => ['360108.3', '100226.1'],
36 :     -format => 'names' });
37 : parrello 1.1 for my $id (keys %$resultHash) {
38 :     my $taxonomy = $resultHash->{$id};
39 : parrello 1.4 print "$id: " . join(" ", @$taxonomy) . "\n";
40 : parrello 1.1 }
41 :    
42 : parrello 1.4 The output from this program (reformatted slightly for readability) is shown below.
43 :    
44 :     360108.3: Bacteria, Proteobacteria, delta/epsilon subdivisions, Epsilonproteobacteria,
45 :     Campylobacterales, Campylobacteraceae, Campylobacter, Campylobacter jejuni,
46 :     Campylobacter jejuni subsp. jejuni, Campylobacter jejuni subsp. jejuni 260.94
47 :    
48 :     100226.1: Bacteria, Actinobacteria, Actinobacteria (class), Actinobacteridae,
49 :     Actinomycetales, Streptomycineae, Streptomycetaceae, Streptomyces,
50 :     Streptomyces coelicolor, Streptomyces coelicolor A3(2)
51 :    
52 : parrello 1.1 For convenience, you can specify the parameters as a simple hash rather
53 :     than a hash reference. So, for example, the above I<taxonomy_of> call could
54 :     also be written like this.
55 :    
56 : parrello 1.7 my $resultHash = $sapServer->taxonomy_of(-ids => ['360108.3', '100226.1'],
57 :     -format => 'names');
58 : parrello 1.1
59 :     =head2 A Simple Example: Genome Taxonomies
60 :    
61 :     Let's look at a simple program that pulls all the complete genomes from the
62 :     database and displays their representative genomes plus their texonomies in name
63 :     format.
64 :    
65 :     Three Sapling Server methods are needed to perform this function.
66 :    
67 :     =over 4
68 :    
69 :     =item *
70 :    
71 :     L<SAP/all_genomes> to get the list of genome IDs.
72 :    
73 :     =item *
74 :    
75 :     L<SAP/taxonomy_of> to get the genome taxonomies.
76 :    
77 :     =item *
78 :    
79 :     L<SAP/representative> to get the representative genome IDs.
80 :    
81 :     =back
82 :    
83 :     The program starts by connecting to the Sapling Server itself.
84 :    
85 :     use strict;
86 :     use SAPserver;
87 :    
88 :     my $sapServer = SAPserver->new();
89 :    
90 : parrello 1.8 Now we use I<all_genomes> to get a list of the complete genomes.
91 :     I<all_genomes> will normally return B<all> genomes, but we use the
92 :     C<-complete> option to restrict the output to those that are complete.
93 : parrello 1.1
94 : parrello 1.8 my $genomeIDs = $sapServer->all_genomes(-complete => 1);
95 : parrello 1.1
96 :     All we want are the genome IDs, so we use a PERL trick to convert the
97 : parrello 1.8 hash reference in C<$genomeIDs> to a list reference.
98 : parrello 1.1
99 :     $genomeIDs = [ keys %$genomeIDs ];
100 :    
101 :     Now we ask for the representatives and the taxonomies.
102 :    
103 : parrello 1.7 my $representativeHash = $sapServer->representative(-ids => $genomeIDs);
104 :     my $taxonomyHash = $sapServer->taxonomy_of(-ids => $genomeIDs,
105 :     -format => 'names');
106 : parrello 1.1
107 :     Our data is now contained in a pair of hash tables. The following loop
108 :     stiches them together to produce the output.
109 :    
110 :     for my $genomeID (@$genomeIDs) {
111 :     my $repID = $representativeHash->{$genomeID};
112 :     my $taxonomy = $taxonomyHash->{$genomeID};
113 :     # Format the taxonomy string.
114 :     my $taxonomyString = join(" ", @$taxonomy);
115 :     # Print the result.
116 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
117 :     }
118 :    
119 : parrello 1.4 An excerpt from the output of this script is shown below. The first column contains
120 :     a genome ID, the second contains the representative genome's ID, and the third is
121 :     the full taxonomy. Note that the two genomes with very close taxonomies have the
122 : parrello 1.8 same representative genome: this is the expected behavior.
123 : parrello 1.4
124 :     221109.1 221109.1 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Oceanobacillus Oceanobacillus iheyensis Oceanobacillus iheyensis HTE831
125 :     204722.1 204722.1 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Brucellaceae Brucella Brucella suis Brucella suis 1330
126 :     391037.3 369723.3 Bacteria Actinobacteria Actinobacteria (class) Actinobacteridae Actinomycetales Micromonosporineae Micromonosporaceae Salinispora Salinispora arenicola Salinispora arenicola CNS205
127 :     339670.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia Burkholderia cepacia complex Burkholderia cepacia Burkholderia cepacia AMMD
128 :     272560.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia pseudomallei group Burkholderia pseudomallei Burkholderia pseudomallei K96243
129 :     262768.1 262768.1 Bacteria Firmicutes Mollicutes Acholeplasmatales Acholeplasmataceae Candidatus Phytoplasma Candidatus Phytoplasma asteris Onion yellows phytoplasma Onion yellows phytoplasma OY-M
130 :     272624.3 272624.3 Bacteria Proteobacteria Gammaproteobacteria Legionellales Legionellaceae Legionella Legionella pneumophila Legionella pneumophila subsp. pneumophila Legionella pneumophila subsp. pneumophila str. Philadelphia 1
131 :     150340.3 150340.3 Bacteria Proteobacteria Gammaproteobacteria Vibrionales Vibrionaceae Vibrio Vibrio sp. Ex25
132 :     205914.1 205914.1 Bacteria Proteobacteria Gammaproteobacteria Pasteurellales Pasteurellaceae Histophilus Histophilus somni Haemophilus somnus 129PT
133 :     393117.3 169963.1 Bacteria Firmicutes Bacilli Bacillales Listeriaceae Listeria Listeria monocytogenes Listeria monocytogenes FSL J1-194
134 :     203119.1 203119.1 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium Clostridium thermocellum Clostridium thermocellum ATCC 27405
135 :     380703.5 380703.5 Bacteria Proteobacteria Gammaproteobacteria Aeromonadales Aeromonadaceae Aeromonas Aeromonas hydrophila Aeromonas hydrophila subsp. hydrophila Aeromonas hydrophila subsp. hydrophila ATCC 7966
136 :     259536.4 259536.4 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter Psychrobacter arcticus Psychrobacter arcticus 273-4
137 :    
138 : parrello 1.1 The Sapling Server processes lists of data (in this case a list of genome IDs)
139 :     so that you can minimize the overhead of sending requests across the web. You
140 :     can, however, specify a single ID instead of a list to a method call, and
141 :     this would allow you to structure your program with a more traditional loop, as
142 :     follows. To make this process simpler, you construct the Sapling Server object
143 :     in I<singleton mode>. In singleton mode, when you pass in only a single ID,
144 :     you get back an actual result instead of a hash reference.
145 :    
146 :     use strict;
147 :     use SAPserver;
148 :    
149 :     my $sapServer = SAPserver->new(singleton => 1);
150 : parrello 1.7 my $genomeIDs = $sapServer->all_genomes(-complete => 1);
151 : parrello 1.1 $genomeIDs = [ keys %$genomeIDs ];
152 :     for my $genomeID (@$genomeIDs) {
153 : parrello 1.7 my $repID = $sapServer->representative(-ids => $genomeID);
154 :     my $taxonomy = $sapServer->taxonomy_of(-ids => $genomeID,
155 :     -format => 'names');
156 : parrello 1.1 # Format the taxonomy string.
157 :     my $taxonomyString = join(" ", @$taxonomy);
158 :     # Print the result.
159 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
160 :     }
161 :    
162 :     At this point there is a risk that you are bewildered by all the options we've
163 :     presented-- hashes, hash references, singleton mode-- however, the goal here is
164 :     to provide a facility that will fit comfortably with different programming
165 :     styles. The server software tries to figure out how you want to use it and
166 :     adjusts accordingly.
167 :    
168 :     =head2 Specifying Gene IDs
169 :    
170 :     Many of the Sapling Server services return data on genes (a term we use rather
171 : parrello 1.8 loosely to include any kind of genetic I<locus> or I<feature>). The standard
172 : parrello 1.1 method for identifying a gene is the I<FIG ID>, an identifying string that
173 :     begins with the characters C<fig|> and includes the genome ID, the gene type,
174 :     and an additional number for uniqueness. For example, the FIG ID
175 :     C<fig|272558.1.peg.203> describes a protein encoding gene (I<peg>) in
176 :     Bacillus halodurans C-125 (I<272558.1>).
177 :    
178 :     Frequently, however, you will have a list of gene IDs from some other
179 : parrello 1.8 database (e.g. I<NCBI>, I<UniProt>) or in a community format such as Locus Tags
180 :     or gene names. Most services that take gene IDs as input allow you to specify a
181 :     C<-source> option that indicates the type of IDs being used. The acceptable
182 :     formats are as follows.
183 : parrello 1.1
184 :     =over 4
185 :    
186 :     =item CMR
187 :    
188 : parrello 1.5 I<Comprehensive Microbial Resource> (L<http://cmr.jcvi.org>) ID. The CMR IDs for
189 : parrello 1.1 C<fig|272558.1.peg.203> are C<10172815> and C<NTL01BH0204>.
190 :    
191 :     =item GENE
192 :    
193 :     Common Gene name. Often, these correspond to a large number of specified genes.
194 :     For example, C<accD>, which identifies the Acetyl-coenzyme A carboxyl
195 :     transferase beta chain, corresponds to 58 specific genes in the database.
196 :    
197 :     =item GeneID
198 :    
199 :     Common gene number. The common gene number for C<fig|272558.1.peg.203> is
200 :     C<891745>.
201 :    
202 :     =item KEGG
203 :    
204 : parrello 1.5 I<Kyoto Encyclopedia of Genes and Genomes> (L<http://www.genome.jp/kegg>) identifier.
205 : parrello 1.1 For example, the KEGG identifier for C<fig|158878.1.peg.2821> is C<sav:SAV2628>.
206 :    
207 :     =item LocusTag
208 :    
209 :     Common locus tag. For example, the locus tag of C<fig|380703.5.peg.250> is
210 :     C<ABK38410.1>.
211 :    
212 :     =item NCBI
213 :    
214 : parrello 1.5 I<NCBI> (L<http://www.ncbi.nlm.nih.gov>) accession number. The accession numbers for
215 : parrello 1.1 C<fig|272558.1.peg.203> are C<10172815>, C<15612766>, and C<81788207>.
216 :    
217 :     =item RefSeq
218 :    
219 : parrello 1.5 I<NCBI> (L<http://www.ncbi.nlm.nih.gov>) reference sequence identifier. The RefSeq
220 : parrello 1.1 identifier for C<fig|272558.1.peg.203> is C<NP_241069.1>.
221 :    
222 :     =item SEED
223 :    
224 :     FIG identifier. This is the default option.
225 :    
226 :     =item SwissProt
227 :    
228 : parrello 1.5 I<SwissProt> (L<http://www.expasy.ch/sprot>) identifier. For example, the SwissProt
229 : parrello 1.1 identifier for C<fig|243277.1.peg.3952> is C<O31153>.
230 :    
231 :     =item UniProt
232 :    
233 : parrello 1.5 I<UniProt> (L<http://www.uniprot.org>) identifier. The UniProt identifiers for
234 : parrello 1.1 C<fig|272558.1.peg.203> are C<Q9KGA9> and C<Q9KGA9_BACHD>.
235 :    
236 :     =back
237 :    
238 :     You can also mix identifiers of different types by specifying C<mixed>
239 :     for the source type. In this case, however, care must be taken, because the same
240 :     identifier can have different meanings in different databases.
241 :    
242 : parrello 1.6 =head2 Two Normal-Mode Examples
243 : parrello 1.1
244 :     The following examples use the Sapling Server in normal mode: that is, data
245 :     is sent to the server in batches and the results are stitched together
246 :     afterward. In this mode there is significantly reduced overhead, but there is
247 :     also a risk that the server request might time out. If this happens, you may
248 :     want to consider breaking the input into smaller batches. At some point, the
249 :     server system will perform sophisticated flow control to reduce the risk of
250 : parrello 1.8 timeout errors, but we are not yet there.
251 : parrello 1.1
252 :     =head3 Retrieving Functional Roles
253 :    
254 :     There are two primary methods for retrieving functional roles.
255 :    
256 :     =over 4
257 :    
258 :     =item *
259 :    
260 :     L<SAP/ids_to_functions> returns the current functional assignment of a gene.
261 :    
262 :     =item *
263 :    
264 :     L<SAP/ids_to_subsystems> returns the subsystems and roles of a gene.
265 :    
266 :     =back
267 :    
268 :     In both cases, the list of incoming gene IDs is given as a list via the C<ids>
269 :     parameter. It is assumed the IDs are FIG identifiers, but the C<source> parameter
270 :     can be used to specify a different ID type (see L</Specifying Gene IDs>).
271 :    
272 :     B<ids_to_functions> provides the basic functional role. Almost every gene in the
273 :     system will return a result with this method. The following example program
274 :     reads a file of UniProt IDs and produces their functional roles. Note that
275 :     we're assuming the file is a manageable size: since we're submitting the entire
276 :     file at once, we risk a timeout error if it's too big.
277 :    
278 :     use strict;
279 :     use SAPserver;
280 :    
281 :     my $sapServer = SAPserver->new();
282 :     # Read all the gene IDs.
283 :     my @genes = map { chomp; $_ } <STDIN>;
284 :     # Compute the functional roles.
285 : parrello 1.7 my $results = $sapServer->ids_to_functions(-ids => \@genes, -source => 'UniProt');
286 : parrello 1.1 # Loop through the genes.
287 :     for my $gene (@genes) {
288 :     # Did we find a result?
289 :     my $role = $results->{$gene};
290 :     if (defined $role) {
291 :     # Yes, print it.
292 :     print "$gene\t$role\n";
293 :     } else {
294 :     # No, emit a warning.
295 :     print STDERR "$gene was not found.\n";
296 :     }
297 :     }
298 :    
299 : parrello 1.8 Sample output from this script is shown below. Note that one of the input IDs
300 :     was not found.
301 : parrello 1.4
302 :     HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA
303 :     17KD_RICBR rickettsial 17 kDa surface antigen precursor
304 :     18KD_MYCLE 18 KDA antigen (HSP 16.7)
305 :     P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD
306 :     1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 / ACC synthase 4 (ACS4) / identical to gi:940370 [GB:U23481]; go_function: 1-aminocyclopropane-1-carboxylate synthase activity [goid 0016847]; go_process: ethylene biosynthesis [goid 0009693]; go_process: response to auxin stimulus [goid 0009733]
307 :     Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE
308 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) / N-acetylglutamate synthase (EC 2.3.1.1)
309 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25)
310 :    
311 :     Q8YY27_ANASP was not found.
312 :    
313 : parrello 1.8 B<ids_to_subsystems> returns roles in subsystems. Roles in subsystems have
314 :     several differences from general functional roles. Only half of the genes in the
315 :     database are currently associated with subsystems.A single gene may be in In
316 :     addition, multiple subsystems and may have multiple roles in a subsystem.
317 :    
318 : parrello 1.1 As a result, instead of a single string per incoming gene, B<ids_to_subsystems>
319 :     returns a list. Each element of the list consists of the role name followed by
320 :     the subsystem name. This makes the processing of the results a little more
321 :     complex, because we have to iterate through the list. An empty list indicates
322 :     the gene is not in a subsystem (although it could also indicate the gene was
323 :     not found).
324 :    
325 :     use SAPserver;
326 :    
327 :     my $sapServer = SAPserver->new();
328 :     # Read all the gene IDs.
329 :     my @genes = map { chomp; $_ } <STDIN>;
330 :     # Compute the functional roles.
331 : parrello 1.7 my $results = $sapServer->ids_to_subsystems(-ids => \@genes, -source => 'UniProt');
332 : parrello 1.1 # Loop through the genes.
333 :     for my $gene (@genes) {
334 :     # Did we find a result?
335 :     my $roleData = $results->{$gene};
336 :     if (! @$roleData) {
337 :     # Not in a subsystem: emit a warning.
338 :     print STDERR "$gene is not in a subsystem.\n";
339 :     } else {
340 :     # Yes, print the entries.
341 :     for my $ssData (@$roleData) {
342 :     print "$gene\t$ssData->[0]\t($ssData->[1])\n"
343 :     }
344 :     }
345 :     }
346 :    
347 : parrello 1.4 Sample output from this script is shown below. In this case, four of the input IDs
348 :     failed to find a result; however, two of them (C<O29118> and C<Q8PZM3>) produced multiple
349 :     results.
350 :    
351 :     HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA (NiFe hydrogenase maturation)
352 :     P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD (NiFe hydrogenase maturation)
353 :     Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE (NiFe hydrogenase maturation)
354 :     O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis extended)
355 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis extended)
356 :     O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis)
357 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis)
358 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (CBSS-316057.3.peg.1294)
359 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (tRNA nucleotidyltransferase)
360 :    
361 :     17KD_RICBR is not in a subsystem.
362 :     Q8YY27_ANASP is not in a subsystem.
363 :     18KD_MYCLE is not in a subsystem.
364 :     1A14_ARATH is not in a subsystem.
365 :    
366 : parrello 1.2 =head3 Genes in Subsystems for Genomes
367 : parrello 1.1
368 : parrello 1.2 This next example finds all genes in subsystems for a specific genome. We will
369 :     perform this operation in two phases. First, we will find the subsystems for
370 :     each genome, and then the genes for those subsystems. This requires two Sapling
371 :     Server functions.
372 : parrello 1.1
373 : parrello 1.2 =over 4
374 :    
375 :     =item *
376 :    
377 :     L<SAP/genomes_to_subsystems> returns a list of the subsystems for each
378 :     specified genome.
379 :    
380 :     =item *
381 :    
382 :     L<SAP/ids_in_subsystems> returns a list of the genes in each listed
383 :     subsystem for a specified genome.
384 :    
385 :     =back
386 :    
387 :     Our example program will loop through the genome IDs in an input file. For
388 :     each genome, it will call I<genomes_to_subsystems> to get the subsystem list,
389 :     and then feed the list to I<ids_in_subsystems> to get the result.
390 :    
391 :     L<SAP/ids_in_subsystems> returns gene IDs rather than taking them as input.
392 :     Like L<SAP/ids_to_subsystems> and L<SAP/ids_to_functions>, it takes a C<source>
393 :     parameter that indicates the type of ID desired (e.g. C<NCBI>, C<CMR>, C<LocusTag>).
394 :     In this case, however, the type describes how the gene IDs will be formatted in
395 :     the output rather than the type expected upon input. If a gene does not have an
396 :     ID for a particular source type (e.g. C<fig|100226.1.peg.3361> does not have a
397 :     locus tag), then the FIG identifier is used. The default source type (C<SEED>)
398 :     means that FIG identifiers will be used for everything.
399 :    
400 :     The program is given below.
401 :    
402 :     use strict;
403 :     use SAPserver;
404 :    
405 :     my $sapServer = SAPserver->new();
406 :     # Loop through the input file. Note that this loop will stop on the first
407 :     # blank line.
408 :     while (my $genomeID = <STDIN>) {
409 :     chomp $genomeID;
410 :     # Get the subsystems for this genome.
411 : parrello 1.7 my $subHash = $sapServer->genomes_to_subsystems(-ids => $genomeID);
412 : parrello 1.2 # The data returned for each genome (and in our case there's only one)
413 :     # includes the subsystem name and the variant code. The following
414 :     # statement strips away the variant codes, leaving only the subsystem
415 :     # names.
416 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
417 :     # Ask for the genes in each subsystem, using NCBI identifiers.
418 : parrello 1.7 my $roleHashes = $sapServer->ids_in_subsystems(-subsystems => $subList,
419 :     -genome => $genomeID,
420 :     -source => 'NCBI',
421 :     -roleForm => 'full');
422 : parrello 1.2 # The hash maps each subsystem ID to a hash of roles to lists of feature
423 :     # IDs. We therefore use three levels of nested loops to produce our
424 :     # output lines. At the top level we have a hash mapping subsystem IDs
425 :     # to role hashes.
426 :     for my $subsystem (sort keys %$roleHashes) {
427 :     my $geneHash = $roleHashes->{$subsystem};
428 :     # The gene hash maps each role to a list of gene IDs.
429 :     for my $role (sort keys %$geneHash) {
430 :     my $geneList = $geneHash->{$role};
431 :     # Finally, we loop through the gene IDs.
432 :     for my $gene (@$geneList) {
433 : parrello 1.4 print "$genomeID\t$gene\t$subsystem\t$role\n";
434 : parrello 1.2 }
435 :     }
436 :     }
437 :     }
438 : parrello 1.1
439 : parrello 1.4 An excerpt of the output is shown here.
440 :    
441 :     360108.3 85840651 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueC ATPase
442 :     360108.3 85841812 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueE Radical SAM
443 :     360108.3 85841520 Queuosine-Archaeosine Biosynthesis Queuosine biosynthesis QueD, PTPS-I
444 :     360108.3 85841162 Queuosine-Archaeosine Biosynthesis S-adenosylmethionine:tRNA ribosyltransferase-isomerase (EC 5.-.-.-)
445 :     360108.3 85842244 Queuosine-Archaeosine Biosynthesis tRNA-guanine transglycosylase (EC 2.4.2.29)
446 :     360108.3 85841653 Quinate degradation 3-dehydroquinate dehydratase II (EC 4.2.1.10)
447 :     360108.3 85840760 RNA polymerase bacterial DNA-directed RNA polymerase alpha subunit (EC 2.7.7.6)
448 :     360108.3 85841269 RNA polymerase bacterial DNA-directed RNA polymerase beta subunit (EC 2.7.7.6)
449 :     360108.3 85841348 RNA polymerase bacterial DNA-directed RNA polymerase beta' subunit (EC 2.7.7.6)
450 :     360108.3 85841887 RNA polymerase bacterial DNA-directed RNA polymerase omega subunit (EC 2.7.7.6)
451 :     360108.3 85841283 RNA processing and degradation, bacterial 3'-to-5' exoribonuclease RNase R
452 :     360108.3 85840820 RNA processing and degradation, bacterial Ribonuclease III (EC 3.1.26.3)
453 :     360108.3 85842272 Recycling of Peptidoglycan Amino Acids Aminoacyl-histidine dipeptidase (Peptidase D) (EC 3.4.13.3)
454 :    
455 : parrello 1.3 The I<ids_in_subsystems> service has several useful options for changing the nature
456 :     of the output. For example, in the above program each role is represented by a
457 :     full description (C<roleForm> set to C<full>). If you don't need the roles, you
458 :     can specify C<none> for the role form. You can also request that the gene IDs
459 :     be returned in a comma-separated list instead of a list data structure. These
460 :     two changes can drastically simplify the above program.
461 :    
462 :     use strict;
463 :     use SAPserver;
464 :    
465 :     my $sapServer = SAPserver->new();
466 :     # Loop through the input file. Note that this loop will stop on the first
467 :     # blank line.
468 :     while (my $genomeID = <STDIN>) {
469 :     chomp $genomeID;
470 :     # Get the subsystems for this genome.
471 : parrello 1.7 my $subHash = $sapServer->genomes_to_subsystems(-ids => $genomeID);
472 : parrello 1.3 # The data returned for each genome (and in our case there's only one)
473 :     # includes the subsystem name and the variant code. The following
474 :     # statement strips away the variant codes, leaving only the subsystem
475 :     # names.
476 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
477 :     # Ask for the genes in each subsystem, using NCBI identifiers.
478 : parrello 1.7 my $genesHash = $sapServer->ids_in_subsystems(-subsystems => $subList,
479 :     -genome => $genomeID,
480 :     -source => 'NCBI',
481 :     -roleForm => 'none',
482 :     -grouped => 1);
483 : parrello 1.3 # The hash maps each subsystem ID to a comma-delimited list of gene IDs.
484 :     for my $subsystem (sort keys %$genesHash) {
485 :     my $genes = $genesHash->{$subsystem};
486 :     print "$genomeID\t$subsystem\t$genes\n";
487 :     }
488 :     }
489 :    
490 : parrello 1.4 The sample output in this case looks quite different. The role information is missing,
491 :     and all the data for a subsystem is in a single line.
492 :    
493 :     360108.3 Queuosine-Archaeosine Biosynthesis 85841622, 85841791, 85840661, 85841162, 85841520, 85842244, 85840651, 85841812
494 :     360108.3 Quinate degradation 85841653
495 :     360108.3 RNA polymerase bacterial 85840760, 85841269, 85841348, 85841887
496 :     360108.3 RNA processing and degradation, bacterial 85840820, 85841283
497 :     360108.3 Recycling of Peptidoglycan Amino Acids 85842019, 85842272
498 :    
499 : parrello 1.6 =head2 A More Complicated Example: Operon Upstream Regions
500 :    
501 :     In this example we'll string several services together to perform a more
502 :     complex task: locating the upstream regions of operons involved in a particular
503 :     metabolic pathway. The theory is that we can look for a common pattern in
504 :     these regions.
505 :    
506 :     A metabolic pathway is a subsystem, so we'll enter our database via the
507 :     subsystems. To keep the data manageable, we'll limit our results to
508 :     specific genomes. The program we write will take as input a subsystem name
509 :     and a file of genome IDs.
510 :    
511 :     The worst part of the task is finding the operon for a gene. This involves
512 :     finding all the genes in a neighborhood and isolating the ones that point in
513 :     the correct direction. Fortunately, there is a Sapling Server function--
514 :     L<SAP/make_runs>-- that specifcally performs this task.
515 :    
516 :     To start our program, we create a L<SAPserver> object and pull the subsystem
517 :     name from the command-line parameters. This program is going to be doing a
518 :     lot of complicated, long-running stuff, so we'll usually want to deal with one
519 :     result at a time. To facilitate that, we construct the server helper in
520 :     singleton mode.
521 :    
522 :     use strict;
523 :     use SAPserver;
524 :    
525 :     my $sapServer = SAPserver->new(singleton => 1);
526 :     # Get the subsystem name.
527 :     my $ssName = $ARGV[0];
528 :    
529 :     Our main loop will read through the list of genomes from the standard input
530 :     and call a method I<PrintUpstream> to process each one. We're going to be a bit
531 :     cheesy here and allow our genome loop to stop on the first blank line.
532 :    
533 :     while (my $genomeID = <STDIN>) {
534 :     chomp $genomeID;
535 :     PrintUpstream($sapServer, $ssName, $genomeID);
536 :     }
537 :    
538 :     Now we need to write I<PrintUpstream>. Its first task is to find all the
539 :     genes in the genome that belong to the subsystem. A single call to
540 :     L<SAP/ids_in_subsystems> will get this information. We then feed the
541 :     results into L<SAP/make_runs> to get operons and call L<SAP/upstream> for
542 :     each operon. The program is given below.
543 :    
544 :     sub PrintUpstream {
545 :     my ($sapServer, $ssName, $genomeID) = @_;
546 :     # Because we specify "roleForm => none", we get back one long
547 :     # gene list.
548 : parrello 1.7 my $geneList = $sapServer->ids_in_subsystems(-subsystems => $ssName,
549 :     -genome => $genomeID,
550 :     -roleForm => 'none');
551 : parrello 1.6 # Convert the gene list to a comma-delimited string.
552 :     my $geneString = join(", ", @$geneList);
553 :     # Get a list of operons.
554 : parrello 1.7 my $opList = $sapServer->make_runs(-groups => $geneString);
555 : parrello 1.6 # Loop through the operons.
556 :     for my $opString (@$opList) {
557 :     # Get the first feature's ID.
558 :     my ($front) = split /\s*,/, $opString, 2;
559 :     # Grab its upstream region. We'll include the operon string as the
560 :     # comment text.
561 : parrello 1.7 my $fasta = $sapServer->upstream(-ids => $front,
562 :     -comments => { $front => $opString },
563 :     -skipGene => 1);
564 : parrello 1.6 # Print the result.
565 :     print "$fasta\n";
566 :     }
567 :     }
568 :    
569 :    
570 : parrello 1.1 =cut
571 :    
572 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3