[Bio] / FigKernelPackages / SAPtutorial.pm Repository:
ViewVC logotype

Annotation of /FigKernelPackages/SAPtutorial.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :     use strict;
3 :    
4 :     =head1 Using the Sapling Server -- A Tutorial
5 :    
6 :     =head2 What Is the Sapling Server?
7 :    
8 :     The B<Sapling Server> is a web service that allows you to access data stored in
9 :     the Sapling database, a large-scale MySQL database of genetic data. The Sapling
10 : parrello 1.5 database contains data on public genomes imported from I<RAST> (L<http://rast.nmpdr.org>)
11 : parrello 1.1 and curated by the annotation team of the Fellowship for Interpretation of
12 :     Genomes.
13 :    
14 :     The L<SAPserver> package wraps calls to the Sapling Server so that you can use
15 :     them in your PERL program.
16 :    
17 :     All Sapling Server services are documented in the L<SAP> module. Each method has
18 :     a signature like
19 :    
20 :     my $document = $sapObject->taxonomy_of($args);
21 :    
22 :     where C<$document> is usually a hash reference and C<$args> is B<always> a hash
23 :     reference. The method description includes a section called I<Parameter Hash
24 :     Fields> that describes the fields in C<$args>. For example, L<SAP/taxonomy_of>
25 :     has a field called C<ids> that is to be a list of genome IDs and an optional
26 :     field called C<format> that indicates whether you want taxonomy groups
27 :     represented by numbers, names, or both. To call the I<taxonomy_of> service,
28 :     you create a B<SAPserver> object and call a method with the same name as the
29 :     service.
30 :    
31 : parrello 1.4 use strict;
32 : parrello 1.1 use SAPserver;
33 :    
34 :     my $sapServer = SAPserver->new();
35 :     my $resultHash = $sapServer->taxonomy_of({ ids => ['360108.3', '100226.1'],
36 :     format => 'names' });
37 :     for my $id (keys %$resultHash) {
38 :     my $taxonomy = $resultHash->{$id};
39 : parrello 1.4 print "$id: " . join(" ", @$taxonomy) . "\n";
40 : parrello 1.1 }
41 :    
42 : parrello 1.4 The output from this program (reformatted slightly for readability) is shown below.
43 :    
44 :     360108.3: Bacteria, Proteobacteria, delta/epsilon subdivisions, Epsilonproteobacteria,
45 :     Campylobacterales, Campylobacteraceae, Campylobacter, Campylobacter jejuni,
46 :     Campylobacter jejuni subsp. jejuni, Campylobacter jejuni subsp. jejuni 260.94
47 :    
48 :     100226.1: Bacteria, Actinobacteria, Actinobacteria (class), Actinobacteridae,
49 :     Actinomycetales, Streptomycineae, Streptomycetaceae, Streptomyces,
50 :     Streptomyces coelicolor, Streptomyces coelicolor A3(2)
51 :    
52 : parrello 1.1 For convenience, you can specify the parameters as a simple hash rather
53 :     than a hash reference. So, for example, the above I<taxonomy_of> call could
54 :     also be written like this.
55 :    
56 :     my $resultHash = $sapServer->taxonomy_of(ids => ['360108.3', '100226.1'],
57 :     format => 'names');
58 :    
59 :     =head2 A Simple Example: Genome Taxonomies
60 :    
61 :     Let's look at a simple program that pulls all the complete genomes from the
62 :     database and displays their representative genomes plus their texonomies in name
63 :     format.
64 :    
65 :     Three Sapling Server methods are needed to perform this function.
66 :    
67 :     =over 4
68 :    
69 :     =item *
70 :    
71 :     L<SAP/all_genomes> to get the list of genome IDs.
72 :    
73 :     =item *
74 :    
75 :     L<SAP/taxonomy_of> to get the genome taxonomies.
76 :    
77 :     =item *
78 :    
79 :     L<SAP/representative> to get the representative genome IDs.
80 :    
81 :     =back
82 :    
83 :     The program starts by connecting to the Sapling Server itself.
84 :    
85 :     use strict;
86 :     use SAPserver;
87 :    
88 :     my $sapServer = SAPserver->new();
89 :    
90 :     Now we use I<all_genomes> to get a list of the IDs for complete genomes.
91 :     I<all_genomes> will normally return B<all> genome IDs, but we use the
92 :     C<complete> option to restrict the output to complete genomes.
93 :    
94 :     my $genomeIDs = $sapServer->all_genomes(complete => 1);
95 :    
96 :     All we want are the genome IDs, so we use a PERL trick to convert the
97 :     hash reference to a list reference.
98 :    
99 :     $genomeIDs = [ keys %$genomeIDs ];
100 :    
101 :     Now we ask for the representatives and the taxonomies.
102 :    
103 :     my $representativeHash = $sapServer->representative(ids => $genomeIDs);
104 :     my $taxonomyHash = $sapServer->taxonomy_of(ids => $genomeIDs,
105 :     format => 'names');
106 :    
107 :     Our data is now contained in a pair of hash tables. The following loop
108 :     stiches them together to produce the output.
109 :    
110 :     for my $genomeID (@$genomeIDs) {
111 :     my $repID = $representativeHash->{$genomeID};
112 :     my $taxonomy = $taxonomyHash->{$genomeID};
113 :     # Format the taxonomy string.
114 :     my $taxonomyString = join(" ", @$taxonomy);
115 :     # Print the result.
116 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
117 :     }
118 :    
119 : parrello 1.4 An excerpt from the output of this script is shown below. The first column contains
120 :     a genome ID, the second contains the representative genome's ID, and the third is
121 :     the full taxonomy. Note that the two genomes with very close taxonomies have the
122 :     same representative genome: this is the expected vehavior.
123 :    
124 :     221109.1 221109.1 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Oceanobacillus Oceanobacillus iheyensis Oceanobacillus iheyensis HTE831
125 :     204722.1 204722.1 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Brucellaceae Brucella Brucella suis Brucella suis 1330
126 :     391037.3 369723.3 Bacteria Actinobacteria Actinobacteria (class) Actinobacteridae Actinomycetales Micromonosporineae Micromonosporaceae Salinispora Salinispora arenicola Salinispora arenicola CNS205
127 :     339670.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia Burkholderia cepacia complex Burkholderia cepacia Burkholderia cepacia AMMD
128 :     272560.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia pseudomallei group Burkholderia pseudomallei Burkholderia pseudomallei K96243
129 :     262768.1 262768.1 Bacteria Firmicutes Mollicutes Acholeplasmatales Acholeplasmataceae Candidatus Phytoplasma Candidatus Phytoplasma asteris Onion yellows phytoplasma Onion yellows phytoplasma OY-M
130 :     272624.3 272624.3 Bacteria Proteobacteria Gammaproteobacteria Legionellales Legionellaceae Legionella Legionella pneumophila Legionella pneumophila subsp. pneumophila Legionella pneumophila subsp. pneumophila str. Philadelphia 1
131 :     150340.3 150340.3 Bacteria Proteobacteria Gammaproteobacteria Vibrionales Vibrionaceae Vibrio Vibrio sp. Ex25
132 :     205914.1 205914.1 Bacteria Proteobacteria Gammaproteobacteria Pasteurellales Pasteurellaceae Histophilus Histophilus somni Haemophilus somnus 129PT
133 :     393117.3 169963.1 Bacteria Firmicutes Bacilli Bacillales Listeriaceae Listeria Listeria monocytogenes Listeria monocytogenes FSL J1-194
134 :     203119.1 203119.1 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium Clostridium thermocellum Clostridium thermocellum ATCC 27405
135 :     380703.5 380703.5 Bacteria Proteobacteria Gammaproteobacteria Aeromonadales Aeromonadaceae Aeromonas Aeromonas hydrophila Aeromonas hydrophila subsp. hydrophila Aeromonas hydrophila subsp. hydrophila ATCC 7966
136 :     259536.4 259536.4 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter Psychrobacter arcticus Psychrobacter arcticus 273-4
137 :    
138 : parrello 1.1 The Sapling Server processes lists of data (in this case a list of genome IDs)
139 :     so that you can minimize the overhead of sending requests across the web. You
140 :     can, however, specify a single ID instead of a list to a method call, and
141 :     this would allow you to structure your program with a more traditional loop, as
142 :     follows. To make this process simpler, you construct the Sapling Server object
143 :     in I<singleton mode>. In singleton mode, when you pass in only a single ID,
144 :     you get back an actual result instead of a hash reference.
145 :    
146 :     use strict;
147 :     use SAPserver;
148 :    
149 :     my $sapServer = SAPserver->new(singleton => 1);
150 :     my $genomeIDs = $sapServer->all_genomes(complete => 1);
151 :     $genomeIDs = [ keys %$genomeIDs ];
152 :     for my $genomeID (@$genomeIDs) {
153 :     my $repID = $sapServer->representative(ids => $genomeID);
154 :     my $taxonomy = $sapServer->taxonomy_of(ids => $genomeID,
155 :     format => 'names');
156 :     # Format the taxonomy string.
157 :     my $taxonomyString = join(" ", @$taxonomy);
158 :     # Print the result.
159 :     print join("\t", $genomeID, $repID, $taxonomyString) . "\n";
160 :     }
161 :    
162 :     At this point there is a risk that you are bewildered by all the options we've
163 :     presented-- hashes, hash references, singleton mode-- however, the goal here is
164 :     to provide a facility that will fit comfortably with different programming
165 :     styles. The server software tries to figure out how you want to use it and
166 :     adjusts accordingly.
167 :    
168 :     =head2 Specifying Gene IDs
169 :    
170 :     Many of the Sapling Server services return data on genes (a term we use rather
171 :     loosely to include any kind of genetic I<locus> or C<feature>). The standard
172 :     method for identifying a gene is the I<FIG ID>, an identifying string that
173 :     begins with the characters C<fig|> and includes the genome ID, the gene type,
174 :     and an additional number for uniqueness. For example, the FIG ID
175 :     C<fig|272558.1.peg.203> describes a protein encoding gene (I<peg>) in
176 :     Bacillus halodurans C-125 (I<272558.1>).
177 :    
178 :     Frequently, however, you will have a list of gene IDs from some other
179 : parrello 1.5 database (e.g. I<NCBI> (L<http://www.ncbi.nlm.nih.gov>), I<UniProt> (L<http://www.uniprot.org>))
180 : parrello 1.1 or in a community format such as Locus Tags or gene names. Most services that
181 :     take gene IDs as input allow you to specify a C<source> option that indicates
182 :     the type of IDs being used. The acceptable formats are as follows.
183 :    
184 :     =over 4
185 :    
186 :     =item CMR
187 :    
188 : parrello 1.5 I<Comprehensive Microbial Resource> (L<http://cmr.jcvi.org>) ID. The CMR IDs for
189 : parrello 1.1 C<fig|272558.1.peg.203> are C<10172815> and C<NTL01BH0204>.
190 :    
191 :     =item GENE
192 :    
193 :     Common Gene name. Often, these correspond to a large number of specified genes.
194 :     For example, C<accD>, which identifies the Acetyl-coenzyme A carboxyl
195 :     transferase beta chain, corresponds to 58 specific genes in the database.
196 :    
197 :     =item GeneID
198 :    
199 :     Common gene number. The common gene number for C<fig|272558.1.peg.203> is
200 :     C<891745>.
201 :    
202 :     =item KEGG
203 :    
204 : parrello 1.5 I<Kyoto Encyclopedia of Genes and Genomes> (L<http://www.genome.jp/kegg>) identifier.
205 : parrello 1.1 For example, the KEGG identifier for C<fig|158878.1.peg.2821> is C<sav:SAV2628>.
206 :    
207 :     =item LocusTag
208 :    
209 :     Common locus tag. For example, the locus tag of C<fig|380703.5.peg.250> is
210 :     C<ABK38410.1>.
211 :    
212 :     =item NCBI
213 :    
214 : parrello 1.5 I<NCBI> (L<http://www.ncbi.nlm.nih.gov>) accession number. The accession numbers for
215 : parrello 1.1 C<fig|272558.1.peg.203> are C<10172815>, C<15612766>, and C<81788207>.
216 :    
217 :     =item RefSeq
218 :    
219 : parrello 1.5 I<NCBI> (L<http://www.ncbi.nlm.nih.gov>) reference sequence identifier. The RefSeq
220 : parrello 1.1 identifier for C<fig|272558.1.peg.203> is C<NP_241069.1>.
221 :    
222 :     =item SEED
223 :    
224 :     FIG identifier. This is the default option.
225 :    
226 :     =item SwissProt
227 :    
228 : parrello 1.5 I<SwissProt> (L<http://www.expasy.ch/sprot>) identifier. For example, the SwissProt
229 : parrello 1.1 identifier for C<fig|243277.1.peg.3952> is C<O31153>.
230 :    
231 :     =item UniProt
232 :    
233 : parrello 1.5 I<UniProt> (L<http://www.uniprot.org>) identifier. The UniProt identifiers for
234 : parrello 1.1 C<fig|272558.1.peg.203> are C<Q9KGA9> and C<Q9KGA9_BACHD>.
235 :    
236 :     =back
237 :    
238 :     You can also mix identifiers of different types by specifying C<mixed>
239 :     for the source type. In this case, however, care must be taken, because the same
240 :     identifier can have different meanings in different databases.
241 :    
242 :     =head2 Normal Mode Examples
243 :    
244 :     The following examples use the Sapling Server in normal mode: that is, data
245 :     is sent to the server in batches and the results are stitched together
246 :     afterward. In this mode there is significantly reduced overhead, but there is
247 :     also a risk that the server request might time out. If this happens, you may
248 :     want to consider breaking the input into smaller batches. At some point, the
249 :     server system will perform sophisticated flow control to reduce the risk of
250 :     timeout errors, but we are not yet at that point.
251 :    
252 :     =head3 Retrieving Functional Roles
253 :    
254 :     There are two primary methods for retrieving functional roles.
255 :    
256 :     =over 4
257 :    
258 :     =item *
259 :    
260 :     L<SAP/ids_to_functions> returns the current functional assignment of a gene.
261 :    
262 :     =item *
263 :    
264 :     L<SAP/ids_to_subsystems> returns the subsystems and roles of a gene.
265 :    
266 :     =back
267 :    
268 :     In both cases, the list of incoming gene IDs is given as a list via the C<ids>
269 :     parameter. It is assumed the IDs are FIG identifiers, but the C<source> parameter
270 :     can be used to specify a different ID type (see L</Specifying Gene IDs>).
271 :    
272 :     B<ids_to_functions> provides the basic functional role. Almost every gene in the
273 :     system will return a result with this method. The following example program
274 :     reads a file of UniProt IDs and produces their functional roles. Note that
275 :     we're assuming the file is a manageable size: since we're submitting the entire
276 :     file at once, we risk a timeout error if it's too big.
277 :    
278 :     use strict;
279 :     use SAPserver;
280 :    
281 :     my $sapServer = SAPserver->new();
282 :     # Read all the gene IDs.
283 :     my @genes = map { chomp; $_ } <STDIN>;
284 :     # Compute the functional roles.
285 :     my $results = $sapServer->ids_to_functions(ids => \@genes, source => 'UniProt');
286 :     # Loop through the genes.
287 :     for my $gene (@genes) {
288 :     # Did we find a result?
289 :     my $role = $results->{$gene};
290 :     if (defined $role) {
291 :     # Yes, print it.
292 :     print "$gene\t$role\n";
293 :     } else {
294 :     # No, emit a warning.
295 :     print STDERR "$gene was not found.\n";
296 :     }
297 :     }
298 :    
299 : parrello 1.4 Sample output from this script is shown below. Note that one of the input IDs was
300 :     not found.
301 :    
302 :     HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA
303 :     17KD_RICBR rickettsial 17 kDa surface antigen precursor
304 :     18KD_MYCLE 18 KDA antigen (HSP 16.7)
305 :     P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD
306 :     1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 / ACC synthase 4 (ACS4) / identical to gi:940370 [GB:U23481]; go_function: 1-aminocyclopropane-1-carboxylate synthase activity [goid 0016847]; go_process: ethylene biosynthesis [goid 0009693]; go_process: response to auxin stimulus [goid 0009733]
307 :     Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE
308 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) / N-acetylglutamate synthase (EC 2.3.1.1)
309 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25)
310 :    
311 :     Q8YY27_ANASP was not found.
312 :    
313 : parrello 1.1 B<ids_to_subsystems> returns roles in subsystems. Roles in subsystems have
314 :     several differences from general functional roles. A single gene may be in
315 :     multiple subsystems and may have multiple roles in a subsystem. In addition,
316 :     only half of the genes in the database are currently associated with subsystems.
317 :     As a result, instead of a single string per incoming gene, B<ids_to_subsystems>
318 :     returns a list. Each element of the list consists of the role name followed by
319 :     the subsystem name. This makes the processing of the results a little more
320 :     complex, because we have to iterate through the list. An empty list indicates
321 :     the gene is not in a subsystem (although it could also indicate the gene was
322 :     not found).
323 :    
324 :     use SAPserver;
325 :    
326 :     my $sapServer = SAPserver->new();
327 :     # Read all the gene IDs.
328 :     my @genes = map { chomp; $_ } <STDIN>;
329 :     # Compute the functional roles.
330 :     my $results = $sapServer->ids_to_subsystems(ids => \@genes, source => 'UniProt');
331 :     # Loop through the genes.
332 :     for my $gene (@genes) {
333 :     # Did we find a result?
334 :     my $roleData = $results->{$gene};
335 :     if (! @$roleData) {
336 :     # Not in a subsystem: emit a warning.
337 :     print STDERR "$gene is not in a subsystem.\n";
338 :     } else {
339 :     # Yes, print the entries.
340 :     for my $ssData (@$roleData) {
341 :     print "$gene\t$ssData->[0]\t($ssData->[1])\n"
342 :     }
343 :     }
344 :     }
345 :    
346 : parrello 1.4 Sample output from this script is shown below. In this case, four of the input IDs
347 :     failed to find a result; however, two of them (C<O29118> and C<Q8PZM3>) produced multiple
348 :     results.
349 :    
350 :     HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA (NiFe hydrogenase maturation)
351 :     P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD (NiFe hydrogenase maturation)
352 :     Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE (NiFe hydrogenase maturation)
353 :     O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis extended)
354 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis extended)
355 :     O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis)
356 :     O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis)
357 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (CBSS-316057.3.peg.1294)
358 :     Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (tRNA nucleotidyltransferase)
359 :    
360 :     17KD_RICBR is not in a subsystem.
361 :     Q8YY27_ANASP is not in a subsystem.
362 :     18KD_MYCLE is not in a subsystem.
363 :     1A14_ARATH is not in a subsystem.
364 :    
365 : parrello 1.2 =head3 Genes in Subsystems for Genomes
366 : parrello 1.1
367 : parrello 1.2 This next example finds all genes in subsystems for a specific genome. We will
368 :     perform this operation in two phases. First, we will find the subsystems for
369 :     each genome, and then the genes for those subsystems. This requires two Sapling
370 :     Server functions.
371 : parrello 1.1
372 : parrello 1.2 =over 4
373 :    
374 :     =item *
375 :    
376 :     L<SAP/genomes_to_subsystems> returns a list of the subsystems for each
377 :     specified genome.
378 :    
379 :     =item *
380 :    
381 :     L<SAP/ids_in_subsystems> returns a list of the genes in each listed
382 :     subsystem for a specified genome.
383 :    
384 :     =back
385 :    
386 :     Our example program will loop through the genome IDs in an input file. For
387 :     each genome, it will call I<genomes_to_subsystems> to get the subsystem list,
388 :     and then feed the list to I<ids_in_subsystems> to get the result.
389 :    
390 :     L<SAP/ids_in_subsystems> returns gene IDs rather than taking them as input.
391 :     Like L<SAP/ids_to_subsystems> and L<SAP/ids_to_functions>, it takes a C<source>
392 :     parameter that indicates the type of ID desired (e.g. C<NCBI>, C<CMR>, C<LocusTag>).
393 :     In this case, however, the type describes how the gene IDs will be formatted in
394 :     the output rather than the type expected upon input. If a gene does not have an
395 :     ID for a particular source type (e.g. C<fig|100226.1.peg.3361> does not have a
396 :     locus tag), then the FIG identifier is used. The default source type (C<SEED>)
397 :     means that FIG identifiers will be used for everything.
398 :    
399 :     The program is given below.
400 :    
401 :     use strict;
402 :     use SAPserver;
403 :    
404 :     my $sapServer = SAPserver->new();
405 :     # Loop through the input file. Note that this loop will stop on the first
406 :     # blank line.
407 :     while (my $genomeID = <STDIN>) {
408 :     chomp $genomeID;
409 :     # Get the subsystems for this genome.
410 :     my $subHash = $sapServer->genomes_to_subsystems(ids => $genomeID);
411 :     # The data returned for each genome (and in our case there's only one)
412 :     # includes the subsystem name and the variant code. The following
413 :     # statement strips away the variant codes, leaving only the subsystem
414 :     # names.
415 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
416 :     # Ask for the genes in each subsystem, using NCBI identifiers.
417 :     my $roleHashes = $sapServer->ids_in_subsystems(subsystems => $subList,
418 :     genome => $genomeID,
419 :     source => 'NCBI',
420 :     roleForm => 'full');
421 :     # The hash maps each subsystem ID to a hash of roles to lists of feature
422 :     # IDs. We therefore use three levels of nested loops to produce our
423 :     # output lines. At the top level we have a hash mapping subsystem IDs
424 :     # to role hashes.
425 :     for my $subsystem (sort keys %$roleHashes) {
426 :     my $geneHash = $roleHashes->{$subsystem};
427 :     # The gene hash maps each role to a list of gene IDs.
428 :     for my $role (sort keys %$geneHash) {
429 :     my $geneList = $geneHash->{$role};
430 :     # Finally, we loop through the gene IDs.
431 :     for my $gene (@$geneList) {
432 : parrello 1.4 print "$genomeID\t$gene\t$subsystem\t$role\n";
433 : parrello 1.2 }
434 :     }
435 :     }
436 :     }
437 : parrello 1.1
438 : parrello 1.4 An excerpt of the output is shown here.
439 :    
440 :     360108.3 85840651 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueC ATPase
441 :     360108.3 85841812 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueE Radical SAM
442 :     360108.3 85841520 Queuosine-Archaeosine Biosynthesis Queuosine biosynthesis QueD, PTPS-I
443 :     360108.3 85841162 Queuosine-Archaeosine Biosynthesis S-adenosylmethionine:tRNA ribosyltransferase-isomerase (EC 5.-.-.-)
444 :     360108.3 85842244 Queuosine-Archaeosine Biosynthesis tRNA-guanine transglycosylase (EC 2.4.2.29)
445 :     360108.3 85841653 Quinate degradation 3-dehydroquinate dehydratase II (EC 4.2.1.10)
446 :     360108.3 85840760 RNA polymerase bacterial DNA-directed RNA polymerase alpha subunit (EC 2.7.7.6)
447 :     360108.3 85841269 RNA polymerase bacterial DNA-directed RNA polymerase beta subunit (EC 2.7.7.6)
448 :     360108.3 85841348 RNA polymerase bacterial DNA-directed RNA polymerase beta' subunit (EC 2.7.7.6)
449 :     360108.3 85841887 RNA polymerase bacterial DNA-directed RNA polymerase omega subunit (EC 2.7.7.6)
450 :     360108.3 85841283 RNA processing and degradation, bacterial 3'-to-5' exoribonuclease RNase R
451 :     360108.3 85840820 RNA processing and degradation, bacterial Ribonuclease III (EC 3.1.26.3)
452 :     360108.3 85842272 Recycling of Peptidoglycan Amino Acids Aminoacyl-histidine dipeptidase (Peptidase D) (EC 3.4.13.3)
453 :    
454 : parrello 1.3 The I<ids_in_subsystems> service has several useful options for changing the nature
455 :     of the output. For example, in the above program each role is represented by a
456 :     full description (C<roleForm> set to C<full>). If you don't need the roles, you
457 :     can specify C<none> for the role form. You can also request that the gene IDs
458 :     be returned in a comma-separated list instead of a list data structure. These
459 :     two changes can drastically simplify the above program.
460 :    
461 :     use strict;
462 :     use SAPserver;
463 :    
464 :     my $sapServer = SAPserver->new();
465 :     # Loop through the input file. Note that this loop will stop on the first
466 :     # blank line.
467 :     while (my $genomeID = <STDIN>) {
468 :     chomp $genomeID;
469 :     # Get the subsystems for this genome.
470 :     my $subHash = $sapServer->genomes_to_subsystems(ids => $genomeID);
471 :     # The data returned for each genome (and in our case there's only one)
472 :     # includes the subsystem name and the variant code. The following
473 :     # statement strips away the variant codes, leaving only the subsystem
474 :     # names.
475 :     my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ];
476 :     # Ask for the genes in each subsystem, using NCBI identifiers.
477 :     my $genesHash = $sapServer->ids_in_subsystems(subsystems => $subList,
478 :     genome => $genomeID,
479 :     source => 'NCBI',
480 :     roleForm => 'none',
481 :     grouped => 1);
482 :     # The hash maps each subsystem ID to a comma-delimited list of gene IDs.
483 :     for my $subsystem (sort keys %$genesHash) {
484 :     my $genes = $genesHash->{$subsystem};
485 :     print "$genomeID\t$subsystem\t$genes\n";
486 :     }
487 :     }
488 :    
489 : parrello 1.4 The sample output in this case looks quite different. The role information is missing,
490 :     and all the data for a subsystem is in a single line.
491 :    
492 :     360108.3 Queuosine-Archaeosine Biosynthesis 85841622, 85841791, 85840661, 85841162, 85841520, 85842244, 85840651, 85841812
493 :     360108.3 Quinate degradation 85841653
494 :     360108.3 RNA polymerase bacterial 85840760, 85841269, 85841348, 85841887
495 :     360108.3 RNA processing and degradation, bacterial 85840820, 85841283
496 :     360108.3 Recycling of Peptidoglycan Amino Acids 85842019, 85842272
497 :    
498 : parrello 1.1 =cut
499 :    
500 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3