[Bio] / FigTutorial / API.html Repository:
ViewVC logotype

Annotation of /FigTutorial/API.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (view) (download) (as text)

1 : overbeek 1.1 <h1>Accessing SEED Services</h1>
2 :     <h2>Introduction</h2>
3 :     This document is designed to help a programmer understand how to
4 :     access the services provided by the SEED. It is not designed for the
5 :     user who wishes to confine his experience to working with the SEED
6 :     browser over the Web.
7 :     <p>
8 :     Before we launch into a discussion of the details of how to invoke
9 :     specific routines, there are some background issues that you should
10 :     know about. First, let's talk about <b>fig</b>, which is a
11 :     command-line utility that is used for a number of purposes. Its most
12 :     basic purpose is to exercise the API and illustrate what data is
13 :     accessed by each operation. However, you may also find it extremely
14 :     useful as a quick way to invoke services. To try it out, type
15 :     <pre>
16 :     fig
17 :     </pre>
18 :     This should produce a prompt of the form
19 :     <pre>
20 :     ??
21 :     </pre>
22 :     You should type
23 :     <pre>
24 :     ?? h&lt;cr&gt;
25 :     </pre>
26 :     to get a listing of the commands
27 :     supported by the utility.
28 :     At the time this document was written (May, 2004), the output was as follows:
29 :     <pre>
30 :     <b>?? h</b>
31 :     DB Current DB
32 :     abbrev genome_name
33 :     access_sims [expand|func|expandfunc] (for timeing - access 10 distinct sims)
34 :     add_annotation FID User [ prompted for annotation; terminate with "." at start of line ]
35 :     add_chromosomal_clusters File
36 :     add_genome GenomeDir
37 :     add_pch_pins File
38 :     add_peg_link PEG Link (e.g., &lt;a href=http//...&gt;link to somewhere&lt;/a&gt;)
39 :     all_compounds
40 :     all_features GenomeID Type
41 :     all_maps
42 :     all_protein_families
43 :     all_reactions
44 :     all_roles
45 :     all_sets Relation SetName
46 :     assign_function PEG User [conf=X] Function
47 :     assign_functionF User File [no_annotations]
48 :     assignments_made who date G1 G2 ... [ none => all ]
49 :     auto_assign PEG [Seq]
50 :     auto_assignF User FileOfPEGs
51 :     auto_assignG User GenomeID1 GenomeID2 ...
52 :     bbhs PEG Cutoff
53 :     between n1 n2 n3
54 :     blast PEG [against nr]
55 :     blastit PEG seq db maxP
56 :     boundaries_of Loc
57 :     build_tree_of_complete min_for_label
58 :     by_alias alias
59 :     by_fig_id FID1 FID2
60 :     candidates_for_role Genome Cutoff Role
61 :     cas CID
62 :     cas_to_cid cas
63 :     catalyzed_by RID
64 :     catalyzes role
65 :     cgi_url
66 :     clean_tmp
67 :     close_genes FID distance
68 :     comp2react CID
69 :     contig_ln GenomeID Contig
70 :     coupling_and_evidence FID Bound SimCutoff CouplingCutoff
71 :     crude_estimate_of_distance GenomeID1 GenomeID2
72 :     delete_all_peg_links PEG
73 :     delete_genomes G1 G2 G3 ...Gn
74 :     delete_peg_link PEG URL
75 :     displayable_reaction RID
76 :     dna_seq GenomeID Loc
77 :     dsims FastaFile [MaxN MaxPsc Select] *** select defaults to raw
78 :     ec_to_maps EC
79 :     ec_name EC
80 :     expand_ec EC
81 :     epoch_to_readable [gives readable version of current time]
82 :     export_chromosomal_clusters
83 :     export_pch_pins
84 :     export_set Relation SetName File
85 :     exportable_subsystem Subsystem
86 :     external_calls PEG1 PEG2 PEG3...
87 :     extract_seq ContigsFile loc
88 :     all_exchangable_subsystems
89 :     best_bbh_candidates Genome Cutoff Requested Known
90 :     family_function Family
91 :     fast_coupling FID Bound CouplingCutoff
92 :     feature_aliases FID
93 :     feature_annotations FID
94 :     feature_location FID
95 :     file2N File
96 :     ftype FID
97 :     function_of ID [all] [trans]
98 :     genes_in_region GenomeID Contig Beg End
99 :     genome_of PEG
100 :     genomes [complete|Pat]
101 :     genome_counts [complete]
102 :     genome_version GenomeID
103 :     genus_species GenomeID
104 :     get_translation ID
105 :     get_translations File [tab]
106 :     h [command] *** h h for help on how to use help ***
107 :     hypo Function
108 :     ids_in_family Family
109 :     ids_in_set WhichSet Relation SetName
110 :     in_cluster_with PEG
111 :     in_family FID
112 :     in_pch_pin_with FID
113 :     in_sets Which Relation SetName
114 :     is_archaeal GenomeID
115 :     is_bacterial GenomeID
116 :     is_eukaryotic GenomeID
117 :     is_prokaryotic GenomeID
118 :     is_exchangable_subsystem Subsystem
119 :     is_real_feature FID
120 :     largest_clusters FileOfRoles user
121 :     load_all
122 :     map_to_ecs map
123 :     map_name map
124 :     mapped_prot_ids ID
125 :     maps_to_id ID
126 :     max n1 n2 n3 ...
127 :     merged_related_annotations PEG1 PEG2 ... PEGn
128 :     min n1 n2 n3 ...
129 :     names_of_compound
130 :     neighborhood_of_role role
131 :     org_of prot_id
132 :     pegs_of GenomeID
133 :     possibly_truncated FID
134 :     reaction2comp RID
135 :     related_by_func_sim PEG user
136 :     reversible RID
137 :     rnas_of GenomeID
138 :     roles_of_function function
139 :     same_func 'function1' 'function2'
140 :     search_index pattern
141 :     seqs_with_role role who
142 :     seqs_with_roles_in_genomes genomes roles made_by
143 :     sims PEG maxN maxP select
144 :     sort_fids_by_taxonomy FID1 FID2 FID3 ...
145 :     sort_genomes_by_taxonomy GenomeID1 GenomeID2 GenomeID3 ...
146 :     sz_family family
147 :     taxonomic_groups_of_complete min_for_labels
148 :     taxonomy_of GenomeID
149 :     translatable prot_id
150 :     translate_function PEG user
151 :     translated_function_of PEG user
152 :     translation_length ID
153 :     unique_functions PEGs user [make the pegs comma separated]
154 :     verify_dir dir
155 :    
156 :     </pre>
157 :     If you quickly browse through the commands that are supported by <b>fig</b>,
158 :     you should get a fairly complete overview of the services provided by the API.
159 :     We are not going to cover the use of <b>fig</b> in detail in this
160 :     tutorial, but we do encourage you to experiment with it. By the way,
161 :     you should type <b>x</b> to exit from it.
162 :     <p>
163 :     <b>fig</b> is ususally accessed in interactive mode (invoked with no
164 :     arguments). However, if you wish to invoke it to process a single
165 :     command, you can just put the command as a single argument on the
166 :     command-line. Thus,
167 :     <pre>
168 :     fig 'add_peg_link fig|188.1.peg.96 &lt;a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/&gt;animation&lt;/a&gt;'
169 :     </pre>
170 :     would add a link to PEG <i>fig|188.1.peg.96</i> while
171 :     <pre>
172 :     fig 'delete_all_peg_links fig|188.1.peg.96'
173 :     </pre>
174 :     would delete all links from the PEG. Since adding links to PEGs is
175 :     something you might wish to do, the easiest way to do it is to just
176 :     create a file of commands of the sort shown, make the file executable,
177 :     and then run it.
178 :     <p>
179 :     If you are a Perl programmer, eventually you will wish to peruse the
180 :     <b>fig</b> source code to see how it is structured, how to add to it,
181 :     and how to invoke services via the API.
182 :     <p>
183 :     We have briefly discussed the utility <b>fig</b>. Now it is time to
184 :     discuss writing your own programs using the API. Basically, the
185 :     entire API is encapsulated in the Perl module <b>FIG.pm</b>. To illustrate
186 :     the style of programming we recommend, let's create a simple program
187 :     to add a link to a PEG (rather than using <b>fig</b>).
188 :     Adding a link
189 :     to a PEG will cause the link to be displayed whenever anyone examines
190 :     the PEG via the SEED web services..
191 :     The API defines a set of services that do many things; adding a link
192 :     to a PEG is just one very minor service, but understanding exactly how one
193 :     could invoke such a service is what this tutorial is about.
194 :     To begin, use
195 :     <pre>
196 :     tool_hdr add_link
197 :     </pre>
198 :     This creates a file called <i>add_link</i> and makes it executable.
199 :     The file contains an initial set of Perl commands designed to
200 :     establish the correct execution environment (where the libraries are,
201 :     and so forth). You need to add your code to the end of the
202 :     initialized file. We recommend adding the following lines:
203 :     <hr>
204 :     <pre>
205 :     use FIG;
206 :     my $fig = new FIG;
207 :    
208 :     $usage = "usage: add_link PEG Link";
209 :    
210 :     (
211 :     ($peg = shift @ARGV) &&
212 :     ($link = shift @ARGV)
213 :     )
214 :     || die $usage;
215 :    
216 :     $fig->add_peg_links([$peg,$link]);
217 :     </pre>
218 :     <h3>A short program to add a link to a PEG</h3>
219 :     <hr>
220 :     We urge you to study this short program carefully, since it reflects
221 :     the style we will use throughout this tutorial. There are a number of
222 :     things worth noting:
223 :     <ul>
224 :     <li>
225 :     The lines
226 :     <pre>
227 :     use FIG;
228 :     my $fig = new FIG;
229 :     </pre>
230 :     initialize <i>$fig</i>. This is your key to the fig services. All
231 :     API services are invoked using
232 :     <pre>
233 :     $fig->rtn(arguments)
234 :     </pre>
235 :     <li>
236 :     The last line of the program is the actual invocation of the service
237 :     to add a link to a peg. Actually, <i>add_peg_links</i> is a service
238 :     routine that takes any number of arguments. Each argument is a
239 :     2-tuple (i.e., a pointer to a list containing two elements). Each
240 :     2-tuple contains a PEG and a link. Links should be of the form:
241 :     <pre>
242 :     &lt;a href=URL&gt;LABEL&lt;/a&gt;
243 :     </pre>
244 :     </ul>
245 : olson 1.2
246 :     We suggest that you type in this program and get it to run on your
247 : overbeek 1.1 version of the SEED. Then, use <b>fig</b> to delete any links you
248 :     added.
249 : olson 1.2
250 : overbeek 1.1 <h2>Accessing Data Relating to Genomes</h2>
251 : olson 1.2
252 : overbeek 1.1 Let us now begin a somewhat more systematic exploration of the services offered by the API.
253 :     We will begin with genome-related services, of which there are relatively few in the SEED.
254 :     Essentially, genomes are represented by IDs of the form <i>xxx.y</i>
255 :     where <i>xxx</i> is the NCBI taxon ID, and <i>y</i> is a suffix to
256 :     disambiguate situations in which you have multiple genomes with the
257 :     same taxon ID.
258 : olson 1.2 <p>
259 :     <blockquote>
260 :     <table border="1">
261 :     <tr>
262 :     <td>genomes.ex.pl</td>
263 :     <td>genomes.ex.py</td>
264 :     </tr>
265 :     <tr>
266 :     <td bgcolor="f0f0f0"><pre>
267 : overbeek 1.1
268 :     use strict;
269 :     use FIG;
270 :     my $fig = new FIG;
271 :    
272 :     my(%sofar,$genome,$which,$abbrev,$version);
273 :    
274 :     foreach $genome ($fig->genomes("complete"))
275 :     {
276 :     $which = $fig->genus_species($genome);
277 :     $abbrev = &get_unique_abbreviation($which,\%sofar);
278 :     $version = $fig->genome_version($genome);
279 :     print "$genome\t$abbrev\t$version\t$which\n";
280 :     }
281 :    
282 :     sub get_unique_abbreviation {
283 :     my($which,$sofar) = @_;
284 :     my($nonunique,$n);
285 :    
286 :     $nonunique = &FIG::abbrev($which);
287 :     $nonunique =~ s/ //g;
288 :     $nonunique =~ s/\.+$//;
289 :     $n = $sofar->{$nonunique};
290 :     if (! defined($n))
291 :     {
292 :     $n = $sofar->{$nonunique} = 1;
293 :     }
294 :     $sofar->{$nonunique}++;
295 :     return $nonunique . ".$n";
296 :     }
297 : olson 1.2 </pre></td>
298 :     <td bgcolor="f0f0f0"><pre>
299 : golsen 1.4 from FigKernelPackages import FIG
300 :    
301 :     import sys
302 :    
303 :     import string
304 :    
305 :    
306 :    
307 :    
308 :    
309 :     def die(msg):
310 :    
311 :     print msg
312 :    
313 :     sys.exit(0)
314 :    
315 :    
316 :    
317 :     fig = FIG.FIG()
318 :    
319 :     sofar = {}
320 :    
321 :    
322 :    
323 :     def get_unique_abbreviation(which, sofar):
324 :    
325 :     nonunique = fig.abbrev(which);
326 :    
327 :     nonunique = string.rstrip(string.replace(nonunique," ", ""), ".")
328 :    
329 :     if not sofar.has_key(nonunique):
330 :    
331 :     sofar[nonunique] = 1
332 :    
333 :     n = sofar[nonunique]
334 :    
335 :     sofar[nonunique] = sofar[nonunique] + 1
336 :    
337 :     return (nonunique+"."+str(n))
338 :    
339 :    
340 :    
341 :     for genome in fig.genomes("complete"):
342 :    
343 :     which = fig.genus_species(genome)
344 :    
345 :     abbrev = get_unique_abbreviation(which,sofar);
346 :    
347 :     version = fig.genome_version(genome);
348 :    
349 :     print "%s\t%s\t%s\t%s" % (genome, abbrev, version, which)
350 :    
351 :    
352 :    
353 : olson 1.2 </pre></td>
354 :     </tr>
355 :     </table>
356 :     </blockquote>
357 :    
358 : overbeek 1.1 <h3>An illustration of genome-related services</h3>
359 : olson 1.2
360 : overbeek 1.1 <hr>
361 :     There are a few points worth noting about the short example:
362 :     <ul>
363 :     <li>
364 : olson 1.2
365 : overbeek 1.1 <i>$fig->genomes</i> returns a list of all of the genome IDs. It has
366 :     two arguments that can be omitted. The first is used to request only
367 :     complete (or near complete) genomes (this is used in the example).
368 :     For each genome in the SEED there is a directory in
369 :     <i>$FIG_Config::organisms</i> that encapsulates data about the genome.
370 :     Genomes will be considered "complete" iff there exists a file
371 :     <i>$FIG_Config::organisms/$genome/COMPLETE</i> (where $genome contains the
372 :     genome ID). The second argument can be used to restrict the returned
373 :     list of genomes to those that have restrictions on distribution
374 :     (indicated by the presence of
375 :     <i>$FIG_Config::organisms/$genome/RESTRICTIONS</i>). Thus,
376 :     <i>$fig->genomes("complete","restricted")</i> could be used to get a
377 :     list of complete, but restricted, genomes.
378 :     <li>
379 :     There are a number of routines contained in <i>FIG.pm</i> which are
380 :     not really services, but are there because they are simply offer
381 :     convenient subroutines. They probably should not be part of an API.
382 :     For example <i>&FIG::between($x,$y,$z)</i> can be invoked to return
383 :     "true" iff <i>$y</i> is between <i>$x</i> and <i>$z</i>.
384 :     <i>&FIG::abbrev($which)</i> invokes a simple routine that produces an
385 :     abbrevbiation of a genome description. It will usually contain
386 :     spaces and sometimes trailing periods, so if you want an abbreviation without these, you have to get
387 :     rid of them. Further, the abbreviations are not unique (that is,
388 :     invoking this routine for two distinct genomes can produce the same
389 :     result). Hence, we give a simple little routine that produces unique
390 :     abbreviations (suitable, for example, to use for sequences in an
391 :     alignment).
392 :     <li>
393 :     Finally, it is worth noting that, as of now, we think of a genome
394 :     "version" as composed of a genome ID concatenated with a checksum.
395 :     This may or may not change in the future.
396 :     </ul>
397 :     Here are just a few lines from the output of this little program:
398 :     <pre>
399 :     197221.1 The.elon.BP.1 197221.1.3406724106 Thermosynechococcus elongatus BP-1
400 :     198094.1 Bac.anth.st.2 198094.1.2593352402 Bacillus anthracis str. Ames
401 :     198214.1 Shi.flex.2a.1 198214.1.1464268764 Shigella flexneri 2a str. 301
402 :     198215.1 Shi.flex.2a.2 198215.1.1756034760 Shigella flexneri 2a str. 2457T
403 :     </pre>
404 :     What other genome-related services does the SEED offer? The only ones
405 :     worth mentioning here relate to <i>contigs</i>. Here is a short
406 :     example illustrating how to use those services. Basically, you can
407 :     <ol>
408 :     <li>
409 :     get a list of the contigs in a given genome,
410 :     <li>
411 :     get the length of a specified contig, and
412 :     <li>
413 :     extract sequence from a contig.
414 :     </ol>
415 :     To illustrate these features, here is a short program that list all of
416 :     the contigs for a designated genome. For each contig that is over 50
417 :     characters in size (and, they all should be), it displays the contig
418 :     ID, the contig length, and the first 50 characters of sequence.
419 : olson 1.2 <p>
420 :     <blockquote>
421 :     <table border="1">
422 :     <tr>
423 :     <td>contigs_for_genome.pl</td>
424 :     <td>contigs_for_genome.py</td>
425 :     </tr>
426 :     <tr>
427 :     <td bgcolor="f0f0f0"><pre>
428 : overbeek 1.1
429 :     use strict;
430 :     use FIG;
431 :     my $fig = new FIG;
432 :    
433 :     my $usage = "usage: contigs_for_genome Genome";
434 :    
435 :     my($genome,@contigs,$contig,$len,$seq);
436 :    
437 :     ($genome = shift @ARGV)
438 :     || die $usage;
439 :    
440 :     @contigs = $fig->all_contigs($genome);
441 :     foreach $contig (@contigs)
442 :     {
443 :     $len = $fig->contig_ln($genome,$contig);
444 :     if ($len >= 50)
445 :     {
446 :     $seq = $fig->dna_seq($genome,$contig . "_1_50");
447 :     print "$contig\t$len\t$seq\n";
448 :     }
449 :     }
450 :    
451 : olson 1.2 </pre></td>
452 :     <td bgcolor="f0f0f0"><pre>
453 : golsen 1.4 from FigKernelPackages import FIG
454 :    
455 :     import sys
456 :    
457 :    
458 :    
459 :     fig = FIG.FIG()
460 :    
461 :    
462 :    
463 :     def die(msg):
464 :    
465 :     print msg
466 :    
467 :     sys.exit(0)
468 :    
469 :    
470 :    
471 :     if len(sys.argv) != 2:
472 :    
473 :     die("usage: contigs_for_genome Genome")
474 :    
475 :    
476 :    
477 :     genome = sys.argv[1]
478 :    
479 :    
480 :    
481 :     contigs = fig.all_contigs(genome)
482 :    
483 :    
484 :    
485 :     for contig in contigs:
486 :    
487 :     len = fig.contig_ln(genome,contig)
488 :    
489 :     if len >= 50:
490 :    
491 :     seq = fig.dna_seq(genome,contig+"_1_50")
492 :    
493 :     print contig+"\t"+len+"\t"+seq
494 :    
495 :    
496 :    
497 : olson 1.2 </pre></td>
498 :     </tr>
499 :     </table>
500 :     </blockquote>
501 :    
502 : overbeek 1.1 <h3>How to access services related to contigs</h3>
503 :     <hr>
504 :     The services being used in this short example are as follows:
505 :     <ul>
506 :     <li>
507 :     <i>$fig->all_contigs($genome)</i> returns a list of the contigs that
508 :     make up the designated genome,
509 :     <li>
510 :     <i>$fig->contig_ln($genome,$contig)</i> returns the length of the
511 :     designated contig, and
512 :     <li>
513 :     <i>$fig->dna_seq($genome,$contig . "_1_50")</i> returns the first 50
514 :     characters of the contig.
515 :     </ul>
516 :     This last service (<i>dna_seq</i>) requires some explanation. You
517 :     need to know that
518 :     <ol>
519 :     <li>
520 :     Locations on a contig begin with 1 (not 0 as some computer scientists
521 :     would prefer).
522 :     <li>
523 :     Within the SEED, locations are represented as a comma-separated list
524 :     of <i>contiguous regions</i>. A contiguous region is represented by a
525 :     contig ID, a beginning
526 :     position, and an ending position all separated by underscores. One
527 :     fact that can be a little confusing is that contig IDs themselves may
528 :     contain underscores. Thus,
529 :     <pre>
530 :     NC_000913_1_50
531 :     </pre>
532 :     would be used to represent the first fifty chracters of the contig
533 :     NC_000913.
534 :     <li>
535 :     Sections of sequence on the complementary strand are represented
536 :     similarly (with the beginning point greater than the ending
537 :     position). Thus,
538 :     <pre>
539 :     NC_000913_50_1
540 :     </pre>
541 :     would represent the sequence on the complementary strand of NC_000913
542 :     going from position 50 to position 1.
543 :     <li>
544 :     The routine <i>dna_seq</i> will accept a location (i.e., a
545 :     comma-separated list of contiguous regions), or even just a list of
546 :     arguments each representing a contiguous region, and return
547 :     the concatenation of the sequences from those locations. Thus,
548 :     <pre>
549 :     $fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100")
550 :     or
551 :     $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")
552 :     </pre>
553 :     would return 99 characters formed by deleting character 51 from the
554 :     first 100 characters of NC_000913.
555 :     </ol>
556 :    
557 :     <h2>Accessing Data Relating to Features</h2>
558 :    
559 :     The SEED was designed, like WIT and ERGO before it, to allow a variety
560 :     of types of features. In fact, the SEED is essentially focused on
561 :     protein-encoding genes, which I perversely named <i>pegs</i> rather
562 :     than <i>CDSs</i>. It is true that WIT contained RNAs, inteins,
563 :     introns, and repeat sequences, and ERGO contained RNAs and IS
564 :     elements; however, the SEED really only has pegs and RNAs at this
565 :     point (and, the RNAs are not adequately handled).
566 :     <p>
567 :     There is a fairly rich set of services for pegs, as you might guess
568 :     from the web services offered by the SEED. As an introduction to some
569 :     of these capabilities, consider the following short program:
570 :    
571 : olson 1.2 <p>
572 :     <blockquote>
573 :     <table border="1">
574 :     <tr>
575 :     <td>features_around.pl</td>
576 :     <td>features_around.py</td>
577 :     </tr>
578 :     <tr>
579 :     <td bgcolor="f0f0f0"><pre>
580 :    
581 : overbeek 1.1 use strict;
582 :     my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg);
583 :     my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region);
584 :     my($start_of_leftmost_feature,$end_of_rightmost_feature);
585 :     my($genome);
586 :    
587 :     use FIG;
588 :     my $fig = new FIG;
589 :    
590 :     $usage = "usage: features_around ID";
591 :    
592 :     ($id = shift @ARGV)
593 :     || die $usage;
594 :    
595 :     if ($peg = $fig->by_alias($id))
596 :     {
597 :     if ($loc = $fig->feature_location($peg))
598 :     {
599 :     ($contig,$beg,$end) = $fig->boundaries_of($loc);
600 :     if (defined($contig))
601 :     {
602 :     $start_reg = &FIG::min($beg,$end) - 5000;
603 :     $end_reg = &FIG::max($beg,$end) + 5000;
604 :     $genome = &FIG::genome_of($peg);
605 :     ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
606 :     $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
607 :     foreach $fid (@$features_in_region)
608 :     {
609 :     $loc1 = $fig->feature_location($fid);
610 :     $aliases1 = $fig->feature_aliases($fid);
611 :     $trunc = $fig->possibly_truncated($fid);
612 :    
613 :     $pseq = $func = $prot_ln = "";
614 :    
615 :     if (&FIG::ftype($fid) eq "peg")
616 :     {
617 :     if ($prot_ln = $fig->translation_length($fid))
618 :     {
619 :     $pseq = $fig->get_translation($fid);
620 :     $func = $fig->function_of($fid);
621 :     }
622 :     }
623 :     print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
624 :     }
625 :     }
626 :     }
627 :     else
628 :     {
629 :     print STDERR "Sorry, could not get the location of $fid\n";
630 :     }
631 :     }
632 :     else
633 :     {
634 :     print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
635 :     }
636 : olson 1.2 </pre></td>
637 :     <td bgcolor="f0f0f0"><pre>
638 : golsen 1.4 from FigKernelPackages import FIG
639 :    
640 :     import sys
641 :    
642 :     import string
643 :    
644 :    
645 :    
646 :     fig = FIG.FIG()
647 :    
648 :    
649 :    
650 :     def die(msg):
651 :    
652 :     print msg
653 :    
654 :     sys.exit(0)
655 :    
656 :    
657 :    
658 :     if len(sys.argv) != 2:
659 :    
660 :     die("usage: features_around ID")
661 :    
662 :    
663 :    
664 :     id = sys.argv[1]
665 :    
666 :    
667 :    
668 :     peg = fig.by_alias(id)
669 :    
670 :     if peg:
671 :    
672 :     loc = fig.feature_location(peg)
673 :    
674 :     if loc:
675 :    
676 :     try:
677 :    
678 :     (contig,beg,end) = fig.boundaries_of(loc)
679 :    
680 :     start_reg = int(fig.min(beg,end)) - 5000
681 :    
682 :     end_reg = int(fig.max(beg,end)) + 5000
683 :    
684 :     genome = fig.genome_of(peg)
685 :    
686 :     (features_in_region,start_of_leftmost_feature,end_of_rightmost_feature) = fig.genes_in_region(genome,contig,start_reg,end_reg);
687 :    
688 :     for fid in features_in_region:
689 :    
690 :     loc1 = fig.feature_location(fid)
691 :    
692 :     aliases1 = fig.feature_aliases(fid)
693 :    
694 :     trunc = fig.possibly_truncated(fid)
695 :    
696 :    
697 :    
698 :     pseq = func = prot_ln = ""
699 :    
700 :    
701 :    
702 :     if fig.ftype(fid) == "peg":
703 :    
704 :     prot_ln = fig.translation_length(fid)
705 :    
706 :     if prot_ln:
707 :    
708 :     pseq = fig.get_translation(fid)
709 :    
710 :     func = fig.function_of(fid)
711 :    
712 :     print string.join([fid, loc1, str(aliases1), str(trunc), str(func), str(prot_ln), pseq], "\t")
713 :    
714 :     #print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
715 :    
716 :     except:
717 :    
718 :     pass
719 :    
720 :     else:
721 :    
722 :     sys.stderr.write("Sorry, could not get the location of %s\n" % fid)
723 :    
724 :     #print STDERR "Sorry, could not get the location of $fid\n";
725 :    
726 :     else:
727 :    
728 :     sys.stderr.write("Sorry, could not figure out which PEG you meant by %s\n" % id)
729 :    
730 :     #print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
731 :    
732 : olson 1.2 </pre></td>
733 :     </tr>
734 :     </table>
735 :     </blockquote>
736 : overbeek 1.1 <h3>Displaying features close to a given feature</h3>
737 :     <hr>
738 :     This short program takes as an argument an ID that corresponds to a
739 :     feature. The ID can be a FIG feature ID of the form
740 :     <pre>
741 :     fig|xxxx.y.peg.zzzz
742 :     </pre>
743 :     where <i>xxxx.y</i> is the genome ID and <i>zzzz</i> is a unique
744 :     sequence number (these usually occur in the same order as the genes
745 :     occur on the contigs, but this is not guaranteed to be true). The
746 :     argument could also be any other ID known by the SEED that uniquely
747 :     corresponds to the FIG ID. For example,
748 :     <pre>
749 :     features_around 'sp|Q9TL34'
750 :     </pre>
751 :    
752 :     would display twelve features (<i>fig|31312.1.peg.1</i> through
753 :     <i>fig|31312.1.peg.12</i>, since <i>sp|Q9TL34</i> corresponds to
754 :     <i>fig|31312.1.peg.5</i>. The program outputs a tab-separated file
755 :     of the sort one might use as input to a spreadsheet like Excel. The
756 :     columns in the output contain:
757 :    
758 :     <ol>
759 :     <li>
760 :     the FIG ID,
761 :     <li>
762 :     the location of the feature,
763 :     <li>
764 :     a list of aliases for the gene,
765 :     <li>
766 :     a flag indicating whether or not the gene occurs near the end of the
767 :     contig (and, hence, is potentially truncated),
768 :     <li>
769 :     the function currently assigned to the gene,
770 :     <li>
771 :     the length of the protein sequence encoded by the gene, and
772 :     <li>
773 :     the protein sequence encoded by the gene.
774 :     </ol>
775 :     Of course, some of the features might not be PEGs, and in that case
776 :     the last three fields are just empty.
777 :     <p>
778 :     Now, let's go through some of the details in the code:
779 :     <ul>
780 :     <li>
781 :     The programs begins by trying to locate the FIG ID corresponding to
782 :     the ID given as the command-line argument. The expression
783 :     <pre>
784 :     $fig->by_alias($id)
785 :     </pre>
786 :     returns the FIG ID when it can be determined (otherwise,
787 :     <i>undef</i>).
788 :     <li>
789 :     Once the FIG ID for the PEG has been determined, the location is
790 :     requested using
791 :     <pre>
792 :     $fig->feature_location($peg)
793 :     </pre>
794 :     Remember, a location is a sequence of contiguous regions separated by
795 :     commas (often a single contiguous region); a contiguous region
796 :     amounts to a contig ID, a beginning position, and an ending position
797 :     separated by underscores. In some contexts, all you want is the
798 :     region containing a gene (without the details of intron/exon
799 :     boundaries). In this case, you can use
800 :     <pre>
801 :     ($contig,$beg,$end) = $fig->boundaries_of($loc);
802 :     </pre>
803 :     to convert the location to a single contiguous region. These will
804 :     succeed only if the sequence of contiguous regions all have the same
805 :     contig ID. <i>$beg</i> will be set to the start of the first exon,
806 :     while <i>$end</i> will be set to the end of the last exon.
807 :     <li>
808 :     Once the boundaries of the gene have been determined, the program asks
809 :     for all genes that overlap a region starting 5000 characters (bases,
810 :     if you like) to the left of the gene and ending 5000 characters to the
811 :     right of the gene. This is done by using
812 :     <pre>
813 :     $genome = &FIG::genome_of($peg);
814 :     </pre
815 :     to get the genome of the feature (<i>$fig->genome_of($peg)</i> would
816 :     work, as well), and then using
817 :     <pre>
818 :     ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
819 :     $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
820 :     </pre>
821 :     to extract the set of genes that overlap the region specified by the
822 :     four arguments <i>$genome, $contig, $start_reg, </i>and
823 :     <i>$end_reg</i>. A list of three things is returned by the
824 :     invocation:
825 :     <ol>
826 :     <li>
827 :     <i>$features_in_region</i> will be set to a pointer to a list of the
828 :     features (represented by FIG IDs) that overlap the specified region.
829 :     <li>
830 :     <i>$start_of_leftmost_feature</i> will be set to the "leftmost"
831 :     position occupied by one of the features that is returned.
832 :    
833 :     <i>$end_of_rightmost_feature</i> will be set to the "rightmost"
834 :     position occupied by one of the features that is returned.
835 :     </ol>
836 :     Thus, if you were writing a graphics module to display the features,
837 :     you would need to show the region from
838 :     <i>$start_of_leftmost_feature</i> to <i>$end_of_rightmost_feature</i>
839 :     if you wished to capture the features without truncation.
840 :     <li>
841 :     Once the features have been acquired, the program has a loop that
842 :     simply extracts data about each feature and prints it.
843 :     <pre>
844 :     $loc1 = $fig->feature_location($fid);
845 :     $aliases1 = $fig->feature_aliases($fid);
846 :     $trunc = $fig->possibly_truncated($fid);
847 :     </pre>
848 :     shows how to get the location of a feature, a comma-separated list of
849 :     aliases, and an indication of whether or not the feature might be
850 :     truncated (this is a rather simple check to just see if the feature
851 :     occurs within a few hundred bases from the end of the contig).
852 :     The use of
853 :     <pre>
854 :     &FIG::ftype($fid)
855 :     </pre>
856 :     just extracts the type of the feature (which is almost always
857 :     <i>peg</i> or <i>rna</i>, although we certainly hope that we will have
858 :     a richer set of feature types in the not too distant future).
859 :     The two invocations
860 :     <pre>
861 :     $prot_ln = $fig->translation_length($fid)
862 :     $pseq = $fig->get_translation($fid)
863 :     </pre>
864 :     show how to get the length of the protein sequence encoded by a PEG
865 :     and the actual sequence itself.
866 :     Finally,
867 :     <pre>
868 :     $func = $fig->function_of($fid)
869 :     </pre>
870 :     is used to get the current master function assignment for the gene.
871 :     The SEED does support non-master assignments, although their use has
872 :     been discouraged. the original intent was to allow inexperienced
873 :     users to make assignments that did not overwrite the "master"
874 :     assignment. Users who identify themselves as <i>master:someone</i>
875 :     will only make assignments that overwrite the master assignment.
876 :     However, a user without the "master:" prefix can make an assignment
877 :     that is retained seprately. To access such an assignment, you would
878 :     use
879 :     <pre>
880 :     $func = $fig->function_of($fid,"someone")
881 :     </pre>
882 :     If <i>someone</i> has not made an assignment to <i>$fig</i>, this will
883 :     return the master assignment.
884 :     </ul>
885 :    
886 :     The SEED comes with a directory containing the example programs we are
887 :     displaying. If you find it hard to follow my comments or feel that I
888 :     have failed left out some detail that you need to understand, please
889 :     <ol>
890 :     <li>
891 :     contact me (Ross@TheFIG.info), and
892 :     <li>
893 :     copy the program and play with it. The most basic form of playing is
894 :     to just do things like inserting
895 :     <pre>
896 :     print &Dumper($some_returned_variable);
897 :     </pre>
898 :     to see precisely what comes back.
899 :     </ol>
900 :    
901 :     <h2>Sequence Data</h2>
902 :    
903 :     You have already been introduced to sequence data. You know how to
904 :     access the DNA from a location in a contig from a designated genome,
905 :     you know you to get locations of features, and you know how to acquire
906 :     the translations for PEGs. In this short section we discuss just a
907 :     few utilities that you might find useful.
908 :     <p>
909 :     The next short example illustrates how to access the reverse
910 :     complement of a DNA sequence (i.e., the opposite strand), how to
911 :     translate DNA sequences, and how to write out sequences in FASTA
912 :     format.
913 :     Let us look at it before thinking about additional complexities like
914 :     how to handle nonstandard translations or properly translating the
915 :     first codon:
916 :    
917 : olson 1.2 <p>
918 :     <blockquote>
919 :     <table border="1">
920 :     <tr>
921 :     <td>seq_utils.pl</td>
922 :     <td>seq_utils.py</td>
923 :     </tr>
924 :     <tr>
925 :     <td bgcolor="f0f0f0"><pre>
926 :    
927 : overbeek 1.1 use FIG;
928 :     my $fig = new FIG;
929 :    
930 : olson 1.2 # seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
931 :    
932 : overbeek 1.1 $usage = "usage: seq_utils Genome Location";
933 :    
934 :     (
935 :     ($genome = shift @ARGV) &&
936 :     ($loc = shift @ARGV)
937 :     )
938 :     || die $usage;
939 :    
940 :     if ($seq = $fig->dna_seq($genome,$loc))
941 :     {
942 :     $seqR = &FIG::reverse_comp($seq);
943 :    
944 :     $tran = &FIG::translate($seq);
945 :     $tranR = &FIG::translate($seqR);
946 :    
947 :     &FIG::display_id_and_seq("plus_strand",\$seq);
948 :     &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
949 :    
950 :     &FIG::display_id_and_seq("minus_strand",\$seqR);
951 :     &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
952 :     }
953 :    
954 : olson 1.2 </pre></td>
955 :     <td bgcolor="f0f0f0"><pre>
956 : golsen 1.4 from FigKernelPackages import FIG
957 :    
958 :     import sys
959 :    
960 :    
961 :    
962 :     fig = FIG.FIG()
963 :    
964 :    
965 :    
966 :     def die(msg):
967 :    
968 :     print msg
969 :    
970 :     sys.exit(0)
971 :    
972 :    
973 :    
974 :     if len(sys.argv) != 3:
975 :    
976 :     die("usage: seq_utils Genome Location")
977 :    
978 :    
979 :    
980 :     genome = sys.argv[1]
981 :    
982 :     loc = sys.argv[2]
983 :    
984 :    
985 :    
986 :     seq = fig.dna_seq(genome, loc)
987 :    
988 :    
989 :    
990 :     if seq:
991 :    
992 :     seqR = fig.reverse_comp(seq);
993 :    
994 :    
995 :    
996 :     tran = fig.translate(seq);
997 :    
998 :     tranR = fig.translate(seqR);
999 :    
1000 :    
1001 :    
1002 :     fig.display_id_and_seq("plus_strand",seq);
1003 :    
1004 :     #fig.display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
1005 :    
1006 :    
1007 :    
1008 :     fig.display_id_and_seq("minus_strand",seqR);
1009 :    
1010 :     #fig.display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
1011 :    
1012 : olson 1.2 </pre></td>
1013 :     </tr>
1014 :     </table>
1015 :     </blockquote>
1016 :    
1017 : overbeek 1.1 <h3>Using some simple sequence utilities</h3>
1018 :     <hr>
1019 :     The program takes two command-line arguments: a
1020 :     genome ID and a location.
1021 :     The program is supposed to write out the DNA from both the plus and
1022 :     minus strands from the given location to STDOUT, and it writes the
1023 :     translations of both strands to STDERR.
1024 :     Thus,
1025 :     <pre>
1026 :     seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
1027 :     </pre>
1028 :     will write the following contents to the file <i>stdout</i>:
1029 :     <pre>
1030 :     >plus_strand
1031 :     ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG
1032 :     GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT
1033 :     GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA
1034 :     TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA
1035 :     CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA
1036 :     GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG
1037 :     GCTGGTGCAACTGTCACAGTGAAA
1038 :     >minus_strand
1039 :     TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC
1040 :     TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT
1041 :     CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG
1042 :     CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC
1043 :     AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC
1044 :     TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT
1045 :     AATTTGATTTGTAATAGTAGACAT
1046 :     </pre>
1047 :     and the following two sequences to the file <i>stderr</i>
1048 :     <pre>
1049 :     >plus_strand_translated
1050 :     MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE
1051 :     FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE
1052 :     AGATVTVK
1053 :     >minus_strand_translated
1054 :     FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW
1055 :     HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN
1056 :     NLICNSRH
1057 :     </pre>
1058 :     Now let us look at the routines within the program that produced these
1059 :     results:
1060 :     <ul>
1061 :     <li>
1062 :     You have already seen the use of something like
1063 :     <pre>
1064 :     $seq = $fig->dna_seq($genome,$loc)
1065 :     </pre>
1066 :     to extract the DNA. The actual location can specify sequence from
1067 :     either strand (remember how?).
1068 :     <li>
1069 :     The we use
1070 :     <pre>
1071 :     $seqR = &FIG::reverse_comp($seq)
1072 :     </pre>
1073 :     to get the reverse complement of the sequence held in <i>$seq</i>.
1074 :     <li>
1075 :     The program uses
1076 :     <pre>
1077 :     $tran = &FIG::translate($seq);
1078 :     $tranR = &FIG::translate($seqR);
1079 :     </pre>
1080 :     to get translations of both strands. Note that these invocations use
1081 :     the standard genetic code to do the translation. The <i>translate</i>
1082 :     service supports two optional arguments. The second argument is used
1083 :     to pass a pointer to a hash containing the desired translation, if you wish to use
1084 :     a nonstandard code. The keys of the hash should be
1085 :    
1086 :     <pre>
1087 :     AAA =>
1088 :     AAC =>
1089 :     AAG =>
1090 :     .
1091 :     .
1092 :     .
1093 :     TTT =>
1094 :     </pre>
1095 :     The values are what the codons are translated to (normally, from the
1096 :     set {A,C,G,T}, but it is essentially unrestricted). If the third
1097 :     argument evaluates to <i>true</i> (i.e., defined, not null string, and
1098 :     not 0), then initial GTG and TTG codons will be translated to "M".
1099 :     <li>
1100 :     The final little utility we employ is used to write the FASTA files:
1101 :     <pre>
1102 :     &FIG::display_id_and_seq("plus_strand",\$seq);
1103 :     &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
1104 :    
1105 :     &FIG::display_id_and_seq("minus_strand",\$seqR);
1106 :     &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
1107 :     </pre>
1108 :     The first argument is the <i>id</i> you wish, the second is a pointer
1109 :     to the sequence to write out, and the third argument (which is
1110 :     optional) is a file handle. It defaults to writing to STDOUT.
1111 :    
1112 :     <h2>Accessing Similarities</h2>
1113 :    
1114 :     A great deal of comparative analysis relates to use of detected
1115 :     similarities. The SEED includes precomputed similarities for each
1116 :     protein sequence. Basically these similarities are computed as
1117 :     follows:
1118 :     <ol>
1119 :     <li>
1120 :     First, a nonredundant protein sequence data base is formed by taking
1121 :     all protein sequences from a number of sources, including the
1122 :     organisms loaded into the SEED (but, including sequences from external
1123 :     sources that are not found in the SEED organisms). Sets of protein
1124 :     sequences that differ only in start calls are collapsed into single
1125 :     entries in the nonredundant database (the longer sequences are
1126 :     retained, and a table of synonyms is constructed showing which IDs
1127 :     correspond to each nonredundant entry, and which of the merged IDs
1128 :     actually represent shorter sequences.
1129 :     <li>
1130 :     Then all of the sequences in the nonredundant database are blasted
1131 :     against one another. For each sequence, there may be only a few or a
1132 :     great many similarities. We keep only the best similarity between any
1133 :     two sequences. Further, we set a maximum number of similarities
1134 :     retained for each sequence to "about 300". We try to keep a set that
1135 :     includes at least the best similarity between the given sequence and
1136 :     sequences from any specific organism. We do not guarantee that we even
1137 :     succeed in that. So, you should think of the precomputed similarities
1138 :     as a potentially useful, but quite truncated, set. We are moving
1139 :     towards a situation in which relationships between closely related sequences are saved in
1140 :     the form of trees (and the corresponding similarities are discarded).
1141 :     <li>
1142 :     When a user requests the similarities to a given PEG, first the SEED
1143 :     must figure out which entry in the nonredundant protein sequence
1144 :     collection corresponds to the PEG (it might actually be the
1145 :     translation corresponding to the PEG, or it might be another sequence
1146 :     ID corresponding to a longer sequence from, say, Swiss Prot). Then,
1147 :     the block of similarities corresponding to the representative ID is
1148 :     extracted. Each of the similarities in this set really represents a
1149 :     collection of similarities (one corresponding to each ID in the set of
1150 :     sequences that were collapsed into a single representative ID). For
1151 :     each of the "raw" similarities, you can choose to expand it into the
1152 :     entire set it represents or not, as you wish. The expansion actually
1153 :     takes much more time than actually extracting the block of
1154 :     similarities, which is an irritating fact (which may be correctable by
1155 :     some bright young computer scientists, but for now...). So, it is
1156 :     important that you understand the two concepts:
1157 :     <ul>
1158 :     <li>
1159 :     A <i>raw similarity</i> is a similarity between two members of the
1160 :     nonredundant protein collection.
1161 :     <li>
1162 :     The act of <i>expanding a similarity</i> amounts to taking a
1163 :     similarity between the given ID (ususally called <i>id1</i>) and
1164 :     another (usually called <i>id2</i>) and expanding it by considering
1165 :     all ids that were collapsed to <i>id2</i>.
1166 :     </ul>
1167 :     <li>
1168 :     Finally, you need to know that when you request similarities against a
1169 :     given id, you get to specify how many of the similarities you wish
1170 :     expanded. This is useful, since one might wish to see the expanded
1171 :     set for, say, the closest five, while getting several hundred
1172 :     unexpanded similarities.
1173 :     </ol>
1174 :     <br>
1175 :     Here is a short program that illustrates some of these comments. It
1176 :     takes as input arguments
1177 :     <ol>
1178 :     <li>
1179 :     a given PEG id,
1180 :     <li>
1181 :     a similarity threshhold (minimum p-score),
1182 :     <li>
1183 :     the maximum number of similarities to be returned, and
1184 :     <li>
1185 :     the number of similarities to expand.
1186 :     </ol>
1187 :     For each similarity, it displays the
1188 :     <ul>
1189 :     <li> the length of the given PEG,
1190 :     <li> id2 (the id of the similar sequence),
1191 :     <li> the length of id2,
1192 :     <li> the p-score,
1193 :     <li> the perecent identity,
1194 :     <li> the region of similarity, and
1195 :     <li> the function assigned to the similar sequence.
1196 :     </ul>
1197 :    
1198 : olson 1.2 <p>
1199 :     <blockquote>
1200 :     <table border="1">
1201 :     <tr>
1202 :     <td>show_similarities.pl</td>
1203 :     <td>show_similarities.py</td>
1204 :     </tr>
1205 :     <tr>
1206 :     <td bgcolor="f0f0f0"><pre>
1207 : overbeek 1.1 use FIG;
1208 :     my $fig = new FIG;
1209 :     use Sim;
1210 :    
1211 :     use strict;
1212 :    
1213 :     my($usage,$peg,$cutoff,$max,$expand);
1214 :     my(@sims,$sim,$id2,$pscore,$iden,$b1,$e1,$b2,$e2,$ln1,$ln2,$func1,$func2);
1215 :    
1216 :     $usage = "usage: show_similarities PEG CutOff MaxReturned Expand";
1217 :    
1218 :     (
1219 :     ($peg = shift @ARGV) &&
1220 :     ($cutoff = shift @ARGV) &&
1221 :     ($max = shift @ARGV) &&
1222 :     defined($expand = shift @ARGV)
1223 :     )
1224 :     || die $usage;
1225 :    
1226 :     @sims = $fig->sims($peg,$max,$cutoff,"all",$expand);
1227 :     foreach $sim (@sims)
1228 :     {
1229 :     $ln1 = $sim->ln1;
1230 :     $id2 = $sim->id2;
1231 :     $ln2 = $sim->ln2;
1232 :     $pscore = $sim->psc;
1233 :     $iden = $sim->iden;
1234 :     $b1 = $sim->b1;
1235 :     $e1 = $sim->e1;
1236 :     $b2 = $sim->b2;
1237 :     $e2 = $sim->e2;
1238 :     $func2 = $fig->function_of($id2);
1239 :     print join("\t",($ln1,$id2,$ln2,$pscore,$iden,$b1,$e1,$b2,$e2,$func2)),"\n";
1240 :     }
1241 : olson 1.2 </pre></td>
1242 :     <td bgcolor="f0f0f0"><pre>
1243 :     from FigKernelPackages import FIG
1244 :     import sys, string
1245 :    
1246 :     fig = FIG.FIG()
1247 :     id1_ix = 0
1248 :     ln1_ix = 12
1249 :     id2_ix = 1
1250 :     ln2_ix = 13
1251 :     iden_ix = 2
1252 :     psc_ix = 10
1253 :     ali_ln_ix = 3
1254 :     mismatches_ix = 4
1255 :     gaps_ix = 5
1256 :     bsc_ix = 11
1257 :     b1_ix = 6
1258 :     e1_ix = 7
1259 :     b2_ix = 8
1260 :     e2_ix = 9
1261 :     bit_score_ix = 11
1262 :     tool_ix = 14
1263 :     def2_ix = 15
1264 :     ali_ix = 16
1265 :    
1266 :    
1267 :     def die(msg):
1268 :     print msg
1269 :     sys.exit(0)
1270 :    
1271 :     if len(sys.argv) != 5:
1272 :     die("usage: show_similarities PEG CutOFF MaxReturned Expand")
1273 :    
1274 :    
1275 :     peg = sys.argv[1]
1276 :     cutoff = sys.argv[2]
1277 :     max = sys.argv[3]
1278 :     expand = sys.argv[4]
1279 :    
1280 :     sims = fig.sims(peg, max, cutoff, "all", expand)
1281 :    
1282 :     for sim in sims:
1283 :     ln1 = str(sim[ln1_ix])
1284 :     id2 = str(sim[id2_ix])
1285 :     ln2 = str(sim[ln2_ix])
1286 :     pscore = str(sim[psc_ix])
1287 :     iden = str(sim[iden_ix])
1288 :     b1 = str(sim[b1_ix])
1289 :     e1 = str(sim[e1_ix])
1290 :     b2 = str(sim[b2_ix])
1291 :     e2 = str(sim[e2_ix])
1292 :     func2 = fig.function_of(id2)
1293 :     print string.join((ln1, id2, ln2, pscore, iden, b1, e1, b2, e2, func2[1]), "\t")
1294 :     </pre></td>
1295 :     </tr>
1296 :     </table>
1297 :     </blockquote>
1298 : overbeek 1.1
1299 :     <h3>How to Access Similarities</h3>
1300 :     <hr>
1301 :     If youn were to run this program using
1302 :     <pre>
1303 :     show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 0
1304 :     </pre>
1305 :     it would produce an output file that started with the following lines:
1306 :     <pre>
1307 :     310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1308 :     310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1309 :     310 pirnr|NF01089494 310 2e-177 99.35 1 310 1 310 Homoserine kinase
1310 :     310 pirnr|NF01137671 310 2e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK)
1311 :     310 fig|216599.1.peg.3713 310 3e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1312 :     </pre>
1313 :     This output amounts to the first five unexpanded similarities. If you
1314 :     were to run
1315 :     <pre>
1316 :     show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 2
1317 :     </pre>
1318 :     it would produce an output file that started with the following lines:
1319 :     <pre>
1320 :     310 sp|P00547 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1321 :     310 pirnr|NF00702903 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK)
1322 :     310 kegg|eco:b0003 310 0 100 1 310 1 310 homoserine kinase (HK) [EC:2.7.1.39]
1323 :     310 kegg|ecj:JW0002 310 0 100 1 310 1 310 Homoserine kinase [EC:2.7.1.39]
1324 :     310 pirnr|NF01368612 310 0 100 1 310 -18 291 homoserine kinase
1325 :     310 gi|33383907 310 0 100 1 310 -18 291 homoserine kinase
1326 :     310 gi|16127997 310 0 100 1 310 1 310 homoserine kinase
1327 :     310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1328 :     310 fig|216593.1.peg.3578 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1329 :     310 fig|216598.1.peg.2077 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1330 :     310 gi|10186209 310 4e-178 99.68 1 310 -21 288 homoserine kinase
1331 :     310 gi|33383857 310 4e-178 99.68 1 310 -18 291 homoserine kinase
1332 :     310 pirnr|NF00706256 310 4e-178 99.68 1 310 -21 288 Homoserine kinase (Fragment)
1333 :     310 pirnr|NF01368414 310 4e-178 99.68 1 310 -18 291 homoserine kinase
1334 :     310 tr|Q9ETA5 310 4e-178 99.68 1 310 -21 288 Homoserine kinase
1335 :     310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1336 :     310 fig|155864.1.peg.3 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1337 :     310 kegg|ece:Z0003 310 9e-178 99.35 1 310 1 310 homoserine kinase [EC:2.7.1.39]
1338 :     </pre>
1339 :     The first seven lines of this expanded set of similarities amount to
1340 :     identical matches against the set of sequences that were collapsed
1341 :     into a single entry in the nonredundant database (i.e., sequences that
1342 :     were essentially identical with <i>fig|83333.1.peg.3</i>). The next
1343 :     eight similarities correspond to the first "raw" similarity, and so
1344 :     forth.
1345 :     <br>
1346 :     Now let us examine the program:
1347 :     <ul>
1348 :     <li>Use the <i>Sim.pm</i> module, if you are processing similarities.
1349 :     That is, include
1350 :     <pre>
1351 :     use Sim;
1352 :     </pre>
1353 :     at the head of your program.
1354 :    
1355 :     <li>
1356 :     The invocation
1357 :     <pre>
1358 :     $fig->sims($peg,$max,$cutoff,"all",$expand)
1359 :     </pre>
1360 :     takes several parameters (the given PEG, the maximum number of
1361 :     similarities to return, the p-score cutoff, a "selection" option, and
1362 :     the number of raw similarities to expand).
1363 :     Itb returns a list of objects of type <i>similarity</i>.
1364 :     The "selection" option can
1365 :     be
1366 :     <ul>
1367 :     <li><i>raw</i> to get just raw similarities (equivalent to expanding
1368 :     0)
1369 :     <li><i>all</i> returns all similarities,
1370 :     <li><i>fig</i> returns only similarities against FIG ids,
1371 :     <li><i>external</i> returns only similarities against ids that are not
1372 :     FIG ids,
1373 :     </ul>
1374 :     To understand thye impact of the parameters, I suggest that you take
1375 :     the simple program and run it with altered parameters, or just use the
1376 :     SEED web services to see what happens.
1377 :    
1378 :    
1379 :     <li>
1380 :     To extract values relating to the similarity, you would use
1381 :     assignments as shown:
1382 :     <pre>
1383 :     $ln1 = $sim->ln1;
1384 :     $id2 = $sim->id2;
1385 :     $ln2 = $sim->ln2;
1386 :     $pscore = $sim->psc;
1387 :     $iden = $sim->iden;
1388 :     $b1 = $sim->b1;
1389 :     $e1 = $sim->e1;
1390 :     $b2 = $sim->b2;
1391 :     $e2 = $sim->e2;
1392 :     </pre>
1393 :     </ul>
1394 :    
1395 :     There is a great deal more that could be said about similarities, but
1396 :     for the purposes of this tutorial, that is it.
1397 :    
1398 :     <h2>Functional Coupling</h2>
1399 :    
1400 :     When I say that two genes are "functionally coupled", I mean that they
1401 :     play closely related roles. They may be multiple subunits of the same
1402 :     complex, they may be enzymes from the same pathway, they may be
1403 :     co-regulated, or whatever. Functional coupling can be inferred using
1404 :     many different technologies:
1405 :     <ul>
1406 :     <li>detection of preserved contiguity on chromosomes,
1407 :     <li>analysis of gene fusion event,
1408 :     <li>detection of similar occurrence profiles,
1409 :     <li>creation of protein-protein interaction data,
1410 :     <li>detection of common transcription regulatory sites, and
1411 :     <li>detection of common expression profiles
1412 :     </ul>
1413 :     While the SEED will eventually support all of these techniques (and,
1414 :     now does support the first three), support for detection of preserved
1415 :     contiguity is the only one that is built into the kernel.
1416 :     Hence, this discussion will focus on detection of functional coupling
1417 :     via identification of preserved contiguity of genes.
1418 :     <p>
1419 :     The SEED supports one central role that can be used to summarize
1420 :     preserved contiguity:
1421 :     The following short program illustrates its use:
1422 :    
1423 : olson 1.2 <p>
1424 :     <blockquote>
1425 :     <table border="1">
1426 :     <tr>
1427 :     <td>show_functional_coupling.pl</td>
1428 :     <td>show_functional_coupling.py</td>
1429 :     </tr>
1430 :     <tr>
1431 :     <td bgcolor="f0f0f0"><pre>
1432 : overbeek 1.1 use FIG;
1433 :     my $fig = new FIG;
1434 :     use strict;
1435 :    
1436 :     my($usage,$bound,$sim_cutoff,$fc_cutoff,@fc,$tuple,$peg);
1437 :    
1438 : olson 1.2 $usage = "usage: show_functional_coupling Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs";
1439 : overbeek 1.1
1440 :     (
1441 :     ($bound = shift @ARGV) &&
1442 :     ($sim_cutoff = shift @ARGV) &&
1443 :     ($fc_cutoff = shift @ARGV)
1444 :     )
1445 :     || die $usage;
1446 :    
1447 :     while (defined($_ = <STDIN>))
1448 :     {
1449 :     if ($_ =~ /^(fig\|\d+\.\d+\.peg\.\d+)/)
1450 :     {
1451 : olson 1.2 $peg = $1;
1452 :     @fc = $fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
1453 :     foreach $tuple (@fc)
1454 :     {
1455 :     # print &Dumper($tuple);
1456 :     print "$peg\t$tuple->[0]\t$tuple->[1]\n";
1457 :     }
1458 : overbeek 1.1 }
1459 :     }
1460 : olson 1.2 </pre></td>
1461 :     <td bgcolor="f0f0f0"><pre>
1462 :     from FigKernelPackages import FIG2
1463 :     import sys, re
1464 :    
1465 :     fig = FIG2.FIG()
1466 :    
1467 :     def die(msg):
1468 :     print msg
1469 :     sys.exit(0)
1470 :    
1471 :     if len(sys.argv) != 4:
1472 :     die("usage: fc Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs")
1473 :    
1474 :    
1475 :     bound = sys.argv[1]
1476 :     sim_cutoff = sys.argv[2]
1477 :     fc_cutoff = sys.argv[3]
1478 :    
1479 :     for line in sys.stdin.readlines():
1480 :     match = re.search("^(fig\|\d+\.\d+\.peg\.\d+)", line)
1481 :     if match:
1482 :     peg = match.group(1)
1483 :     fc = fig.coupling_and_evidence(peg, bound, sim_cutoff, fc_cutoff)
1484 :     for tuple in fc:
1485 :     print "%s\t%s\t%s" % ( peg, tuple[0], tuple[1])
1486 :     </pre></td>
1487 :     </tr>
1488 :     </table>
1489 :     </blockquote>
1490 : overbeek 1.1
1491 :     <h3>How to compute functional coupling scores based on preserved contiguity</h3>
1492 :     <hr>
1493 :     The short program simply reads in a file containing PEG ids. For each
1494 :     PEG, it invokes
1495 :     <pre>
1496 :     @fc =$fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
1497 :     </pre>
1498 :    
1499 :     to compute a list of 3-tuples. Each 3-tuple corresponds to a
1500 :     functionally coupled gene. The 3-tuple contains
1501 :     <ol>
1502 :     <li> the number of occurrences of preservation of homologous pairs
1503 :     that were detected in somewhat diverged genomes,
1504 :     <li> the PEG id of a functionally coupled gene, and
1505 :     <li> a pointer to a list of 2-tuples that describe the pairs in
1506 :     different genomes.
1507 :     </ol>
1508 :     If the program is run using the following command
1509 :     <pre>
1510 :     show_functional_coupling 5000 1.0e-10 2 < tmp
1511 :     </pre>
1512 :     and if <i>tmp</i> contains a single line of the form
1513 :     <pre>
1514 :     fig|83333.1.peg.3
1515 :     </pre>
1516 :     then the program will produce the following output:
1517 :     <pre>
1518 :     $VAR1 = [
1519 :     77,
1520 :     'fig|83333.1.peg.2',
1521 :     [
1522 :     [
1523 :     'fig|263.1.peg.1622',
1524 :     'fig|263.1.peg.1624'
1525 :     ],
1526 :     [
1527 :     'fig|263.1.peg.1622',
1528 :     'fig|263.1.peg.1625'
1529 :     ],
1530 :     [
1531 :     'fig|562.2.peg.2720',
1532 :     'fig|562.2.peg.2721'
1533 :     ],
1534 :     [
1535 :     'fig|573.2.peg.4598',
1536 :     'fig|573.2.peg.4594'
1537 :     ],
1538 :     .
1539 :     . <<< deleted lines showing more pairs >>>
1540 :     .
1541 :     ]
1542 :     ];
1543 :     $VAR1 = [
1544 :     44,
1545 :     'fig|83333.1.peg.4',
1546 :     [
1547 :     [
1548 :     'fig|263.1.peg.1622',
1549 :     'fig|263.1.peg.1620'
1550 :     ],
1551 :     [
1552 :     'fig|562.2.peg.2526',
1553 :     'fig|562.2.peg.2527'
1554 :     ],
1555 :     [
1556 :     'fig|562.2.peg.2521',
1557 :     'fig|562.2.peg.2522'
1558 :     ],
1559 :     [
1560 :     'fig|562.2.peg.2525',
1561 :     'fig|562.2.peg.2527'
1562 :     ],
1563 :     [
1564 :     'fig|573.2.peg.4598',
1565 :     'fig|573.2.peg.4599'
1566 :     ],
1567 :     .
1568 :     . <<< deleted lines showing more pairs >>>
1569 :     .
1570 :     ]
1571 :     ];
1572 :     $VAR1 = [
1573 :     10,
1574 :     'fig|83333.1.peg.6',
1575 :     [
1576 :     [
1577 :     'fig|562.2.peg.2526',
1578 :     'fig|562.2.peg.2529'
1579 :     ],
1580 :     [
1581 :     'fig|562.2.peg.2525',
1582 :     'fig|562.2.peg.2529'
1583 :     ],
1584 :     .
1585 :     . <<< deleted lines showing more pairs >>>
1586 :     .
1587 :     ]
1588 :     ];
1589 :     $VAR1 = [
1590 :     7,
1591 :     'fig|83333.1.peg.7',
1592 :     [
1593 :     [
1594 :     'fig|562.2.peg.2526',
1595 :     'fig|562.2.peg.2531'
1596 :     ],
1597 :     [
1598 :     'fig|562.2.peg.2526',
1599 :     'fig|562.2.peg.2530'
1600 :     ],
1601 :     [
1602 :     'fig|562.2.peg.2525',
1603 :     'fig|562.2.peg.2531'
1604 :     ],
1605 :     [
1606 :     'fig|562.2.peg.2525',
1607 :     'fig|562.2.peg.2530'
1608 :     ],
1609 :     [
1610 :     'fig|573.2.peg.4598',
1611 :     'fig|573.2.peg.4604'
1612 :     ],
1613 :     .
1614 :     . <<< deleted lines showing more pairs >>>
1615 :     .
1616 :     ]
1617 :     ];
1618 :     $VAR1 = [
1619 :     7,
1620 :     'fig|83333.1.peg.8',
1621 :     [
1622 :     [
1623 :     'fig|594.1.peg.4954',
1624 :     'fig|594.1.peg.4958'
1625 :     ],
1626 :     [
1627 :     'fig|630.2.peg.581',
1628 :     'fig|630.2.peg.584'
1629 :     ],
1630 :     [
1631 :     'fig|12149.1.peg.2',
1632 :     'fig|12149.1.peg.6'
1633 :     ],
1634 :     .
1635 :     . <<< deleted lines showing more pairs >>>
1636 :     .
1637 :     ]
1638 :     ];
1639 :     </pre>
1640 :     If you simply replaced
1641 :     <pre>
1642 :     print &Dumper($tuple);
1643 :     </pre>
1644 :     with
1645 :     <pre>
1646 :     print "$peg\t$tuple->[0]\t$tuple->[1]\n";
1647 :     </pre>
1648 :     you would get
1649 :     <pre>
1650 :     fig|83333.1.peg.3 77 fig|83333.1.peg.2
1651 :     fig|83333.1.peg.3 44 fig|83333.1.peg.4
1652 :     fig|83333.1.peg.3 10 fig|83333.1.peg.6
1653 :     fig|83333.1.peg.3 7 fig|83333.1.peg.7
1654 :     fig|83333.1.peg.3 7 fig|83333.1.peg.8
1655 :    
1656 :     </pre>
1657 :     which might be closer to what most end users wish to see.
1658 :     <p>
1659 :     The only detail that might be worth noting is the scoring function.
1660 :     It counts the number of pairs (besides the pair composed of the given
1661 :     PEG and the related PEG) that could be located, where "essentially
1662 :     identical" occurrences are collapsed to a single occurrence. For
1663 :     example, if you had 10 <i>E.coli</i> genomes (of closely related
1664 :     strains), you would wish to count them as a single occurrence.
1665 :     Basically, we collapse occurrences between genes that are encode
1666 :     protein sequences that are more than 97% identical to a single
1667 :     occurrence. So, when you see
1668 :     <pre>
1669 :     fig|83333.1.peg.3 77 fig|83333.1.peg.2
1670 :     </pre>
1671 :     This means that there are genes encoding proteins similar to those
1672 :     encoded by fig|83333.1.peg.3 and fig|83333.1.peg.2 that are within
1673 :     5000 bases of one another in 77 quite distinct genomes.
1674 :    
1675 :     <h2>Maps, Subsystems, Functional Roles, and gene Functions</h2>
1676 :    
1677 :     The SEED currently supports four distinct notions that I attempt to
1678 :     differentiate in ths section. My working definitions would be as
1679 :     follows:
1680 :     <ol>
1681 :     <li>
1682 :     A <b>map</b> is a relatively large set of related functional roles.
1683 :     The notion grew out of working with KEGG maps. It encompasses larger
1684 :     functional units than the related notion <b>pathway</b>.
1685 :     <li>The notion of <b>subsystem</b> is a slight extension of the
1686 :     concept <b>pathway</b>. It is a set of functional roles that together
1687 :     play a related function. A pathway is a metabolic subsystem. A
1688 :     ribosome is a nonmetabolic subsystem.
1689 :     <p>
1690 :     Since KEGG is doing such a useful job in attempting to curate maps, we
1691 :     download their maps and use them. We supplement them occasionally
1692 :     with some drawn by ourselves, but we do think that the community owes
1693 :     KEGG real gratitude for making these maps available.
1694 :    
1695 :     <li>My notion of <b>functional role</b> is roughly an abstraction of
1696 :     the function of a gene (i.e., a monfunctional gene). It is a
1697 :     well-defined component of a subsystem. Variants of subsystems should
1698 :     be thought of as sets of functional roles (i.e., not as sets of
1699 :     genes).
1700 :    
1701 :     <li>A <b>gene function</b> may be thought of as a set of functional
1702 :     roles. Often there is just one, but many genes are multifunctional.
1703 :     Somestimes this means that a single domain of the protein encoded by
1704 :     the gene has low specificity which allows the gene to play multiple
1705 :     roles, and sometimes the gene has several domains with each domain
1706 :     implementing a complete functional role. Either way, if a gene
1707 :     function implements multiple roles, it should have the form
1708 :     <pre>
1709 :     role1 / role2 / role3...
1710 :     </pre>
1711 :     That is, the gene function should be made up of a list of roles
1712 :     separated by " / ".
1713 :     A gene function is said to <i>connect</i> to each of the roles that
1714 :     make it up (those roles that are separated by " / "), as well as any
1715 :     embedded EC numbers. Thus, the gene function
1716 :     <pre>
1717 :     Pyruvate kinase (EC 2.7.1.40) / 6-phosphofructokinase (EC 2.7.1.11)
1718 :     </pre>
1719 :     would connect to
1720 :     <ol>
1721 :     <li>Pyruvate kinase (EC 2.7.1.40),
1722 :     <li>6-phosphofructokinase (EC 2.7.1.11),
1723 :     <li>2.7.1.40, and
1724 :     <li>2.7.1.11
1725 :     </ol>
1726 :     </ol>
1727 :     <p>
1728 :     Normally, when we speak of maps, most of the functional roles are
1729 :     simply EC numbers (although some are not). When we speak of
1730 :     subsystems, the functional roles are almost never specified as just an
1731 :     EC number.
1732 :     <p>
1733 :     Our first simple example program takes as command line arguments a map
1734 :     ID (i.e., one of the KEGG map numbers -- these can be taken off the
1735 :     initial page displayed by the SEED web services) and a genome ID. It
1736 :     displays the map ID and the "map name". Then, it lists the roles
1737 :     contained within the map, and for each role it displays the PEGs
1738 :     associated with the role. Thus,
1739 :    
1740 :     <pre>
1741 :     connect_map_to_genes MAP00010 83333.1
1742 :     </pre>
1743 :     produced the following output for me:
1744 :     <pre>
1745 :     MAP00010 Glycolysis / Gluconeogenesis
1746 :     2.7.1.11
1747 :     fig|83333.1.peg.1706 6-phosphofructokinase II (EC 2.7.1.11)
1748 :     fig|83333.1.peg.3835 6-phosphofructokinase (EC 2.7.1.11)
1749 :     2.7.1.41
1750 :     1.2.1.3
1751 :     fig|83333.1.peg.3522 Aldehyde dehydrogenase (EC 1.2.1.3)
1752 :     5.4.2.1
1753 :     fig|83333.1.peg.3548 2,3-bisphosphoglycerate-independent phosphoglycerate mutase (EC 5.4.2.1)
1754 :     fig|83333.1.peg.4303 Phosphoglycerate mutase (EC 5.4.2.1)
1755 :     fig|83333.1.peg.741 Phosphoglycerate mutase (EC 5.4.2.1)
1756 :     3.1.3.10
1757 :     fig|83333.1.peg.989 Glucose-1-phosphatase precursor (EC 3.1.3.10)
1758 :     1.1.1.27
1759 :     fig|83333.1.peg.787 L-lactate dehydrogenase (EC 1.1.1.27)
1760 :     1.8.1.4
1761 :     fig|83333.1.peg.116 Dihydrolipoamide dehydrogenase (EC 1.8.1.4)
1762 :     3.6.1.7
1763 :     fig|83333.1.peg.953 Acylphosphatase (EC 3.6.1.7)
1764 :     2.7.1.40
1765 :     fig|83333.1.peg.1660 Pyruvate kinase (EC 2.7.1.40)
1766 :     fig|83333.1.peg.1838 Pyruvate kinase (EC 2.7.1.40)
1767 :     2.7.1.1
1768 :     3.1.3.9
1769 :     4.1.1.1
1770 :     5.1.3.15
1771 :     1.1.99.8
1772 :     1.2.4.1
1773 :     fig|83333.1.peg.114 Pyruvate dehydrogenase E1 component (EC 1.2.4.1)
1774 :     5.1.3.3
1775 :     fig|83333.1.peg.742 Aldose 1-epimerase (EC 5.1.3.3)
1776 :     6.2.1.1
1777 :     fig|83333.1.peg.3979 Acetyl-coenzyme A synthetase (EC 6.2.1.1)
1778 :     5.4.2.4
1779 :     2.7.1.69
1780 :     fig|83333.1.peg.1086 PTS system, glucose-specific IIBC component (EIIBC-Glc) (Glucose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1781 :     fig|83333.1.peg.1607 PTS system, maltose and glucose-specific IIABC component (Maltose and glucose-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1782 :     fig|83333.1.peg.1719 PTS system, cellobiose-specific IIA component (EIIA-Cel) (Cellobiose- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1783 :     fig|83333.1.peg.1721 PTS system, cellobiose-specific IIB component (EIIB-Cel) (Cellobiose- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1784 :     fig|83333.1.peg.1800 PTS system, mannose-specific IIAB component (EC 2.7.1.69)
1785 :     fig|83333.1.peg.2068 PTS system, galactitol-specific IIB component (EIIB-GAT) (Galacticol- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1786 :     fig|83333.1.peg.2069 PTS system, galactitol-specific IIA component (EIIA-GAT) (Galacticol- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1787 :     fig|83333.1.peg.2142 PTS system, fructose-specific IIBC component (EIIBC-Fru) (Fructose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1788 :     fig|83333.1.peg.2144 PTS system, fructose-specific IIA/FPr component (EIIA-Fru) (Fructose- permease IIA/FPr component) (Phosphotransferase enzyme II, A/FPr component) (EC 2.7.1.69)
1789 :     fig|83333.1.peg.2165 PTS system, mannitol-specific IIBC component (EC 2.7.1.69)
1790 :     fig|83333.1.peg.2360 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1791 :     fig|83333.1.peg.2385 PTS system, glucose-specific IIA component (EC 2.7.1.69)
1792 :     fig|83333.1.peg.2657 Sorbitol PTS, EIIC (EC 2.7.1.69)
1793 :     fig|83333.1.peg.2658 PTS system, glucitol/sorbitol-specific IIBC component (EC 2.7.1.69)
1794 :     fig|83333.1.peg.2659 PTS system, glucitol/sorbitol-specific IIA component (EIIA-gut) (Glucitol/sorbitol-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1795 :     fig|83333.1.peg.2670 PTS system, arbutin-, cellobiose-, and salicin-specific IIABC component (EIIABC-ASC) (Arbutin-, cellobiose-, and salicin-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1796 :     fig|83333.1.peg.2884 PTS system, mannitol (Cryptic) -specific IIBC component (EIIBC-(C)MTL) (Mannitol (Cryptic)-permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1797 :     fig|83333.1.peg.2885 PTS system, mannitol (Cryptic) -specific IIA component (EIIA-(C)MTL) (Mannitol (Cryptic)-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1798 :     fig|83333.1.peg.3083 PTS system, N-acetylgalactosamine-specific IIB component 1 (EIIB-AGA) (N-acetylgalactosamine-permease IIB component 1) (Phosphotransferase enzyme II, B component 1) (EC 2.7.1.69)
1799 :     fig|83333.1.peg.3147 Nitrogen regulatory IIA protein (EC 2.7.1.69)
1800 :     fig|83333.1.peg.3534 PTS system, mannitol-specific IIABC component (EIIABC-MTL) (Mannitol- permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1801 :     fig|83333.1.peg.3659 PTS system, beta-glucoside-specific IIABC component (EIIABC-Bgl) (Beta-glucoside-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1802 :     fig|83333.1.peg.3820 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1803 :     fig|83333.1.peg.3821 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
1804 :     fig|83333.1.peg.3869 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1805 :     fig|83333.1.peg.3870 PTS system, fructose-like-2 IIB component 1 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1806 :     fig|83333.1.peg.3873 PTS system, fructose-like-2 IIB component 2 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1807 :     fig|83333.1.peg.4150 PTS system, trehalose-specific IIBC component (EIIBC-TRE) (Trehalose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1808 :     fig|83333.1.peg.4213 PTS system, galactitol-specific IIC component (EC 2.7.1.69)
1809 :     fig|83333.1.peg.669 PTS system, glucose-specific IIABC component (EC 2.7.1.69)
1810 :     fig|83333.1.peg.723 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
1811 :     1.1.1.71
1812 :     1.1.1.1
1813 :     fig|83333.1.peg.1228 Aldehyde-alcohol dehydrogenase [Includes: Alcohol dehydrogenase (EC 1.1.1.1) (ADH); Acetaldehyde dehydrogenase [acetylating] (EC 1.2.1.10) (ACDH); Pyruvate-formate-lyase deactivase (PFL deactivase)]
1814 :     fig|83333.1.peg.1301 Alcohol dehydrogenase (EC 1.1.1.1)
1815 :     fig|83333.1.peg.3523 Alcohol dehydrogenase (EC 1.1.1.1)
1816 :     fig|83333.1.peg.353 Alcohol dehydrogenase class III (EC 1.1.1.1)
1817 :     2.7.1.63
1818 :     3.2.1.86
1819 :     fig|83333.1.peg.1717 6-phospho-beta-glucosidase (EC 3.2.1.86)
1820 :     fig|83333.1.peg.2853 6-phospho-beta-glucosidase bglA (EC 3.2.1.86)
1821 :     fig|83333.1.peg.3658 6-phospho-beta-glucosidase bglB (EC 3.2.1.86)
1822 :     3.1.6.3
1823 :     1.2.1.12
1824 :     fig|83333.1.peg.1762 Glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12)
1825 :     2.7.2.3
1826 :     fig|83333.1.peg.2877 Phosphoglycerate kinase (EC 2.7.2.3)
1827 :     5.3.1.9
1828 :     fig|83333.1.peg.3934 Glucose-6-phosphate isomerase (EC 5.3.1.9)
1829 :     1.2.1.5
1830 :     1.2.1.51
1831 :     4.2.1.11
1832 :     fig|83333.1.peg.2735 Enolase (EC 4.2.1.11)
1833 :     5.3.1.1
1834 :     fig|83333.1.peg.3838 Triosephosphate isomerase (EC 5.3.1.1)
1835 :     3.1.3.13
1836 :     4.1.2.13
1837 :     fig|83333.1.peg.1504 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1838 :     fig|83333.1.peg.2072 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1839 :     fig|83333.1.peg.2876 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1840 :     2.7.1.2
1841 :     fig|83333.1.peg.2361 Glucokinase (EC 2.7.1.2)
1842 :     2.3.1.12
1843 :     fig|83333.1.peg.115 Dihydrolipoamide acetyltransferase component of pyruvate dehydrogenase complex (EC 2.3.1.12) (E2)
1844 :     1.1.1.2
1845 :     3.1.3.11
1846 :     fig|83333.1.peg.2881 Fructose-1,6-bisphosphatase (EC 3.1.3.11)
1847 :     fig|83333.1.peg.4142 Fructose-1,6-bisphosphatase (EC 3.1.3.11)
1848 :     5.4.2.2
1849 :     fig|83333.1.peg.678 Phosphoglucomutase (EC 5.4.2.2)
1850 :     </pre>
1851 :    
1852 :     Now let's examine the short program that produced the output:
1853 : olson 1.2
1854 :     <p>
1855 :     <blockquote>
1856 :     <table border="1">
1857 :     <tr>
1858 :     <td>connect_map_to_genes.pl</td>
1859 :     <td>connect_map_to_genes.py</td>
1860 :     </tr>
1861 :     <tr>
1862 :     <td bgcolor="f0f0f0"><pre>
1863 : overbeek 1.1 use FIG;
1864 :     my $fig = new FIG;
1865 :     use Sim;
1866 :    
1867 :     use strict;
1868 :     my($usage,$map,$genome,$name,$role,$peg,$func,$expanded_role);
1869 :    
1870 :     $usage = "usage: connect_map_to_genes Map Genome";
1871 :    
1872 :     (
1873 :     ($map = shift @ARGV) &&
1874 :     ($genome = shift @ARGV)
1875 :     )
1876 :     || die $usage;
1877 :    
1878 :     $name = $fig->map_name($map);
1879 :     print "$map\t$name\n";
1880 : olson 1.2 foreach $role ($fig->map_to_ecs($map))
1881 : overbeek 1.1 {
1882 :     print " $role\n";
1883 :     foreach $peg ($fig->seqs_with_role($role,"master",$genome))
1884 :     {
1885 : olson 1.2 $func = $fig->function_of($peg);
1886 :     print " $peg\t$func\n";
1887 : overbeek 1.1 }
1888 :     }
1889 : olson 1.2 </pre></td>
1890 :     <td bgcolor="f0f0f0"><pre>
1891 :     from FigKernelPackages import FIG2
1892 :     import sys
1893 :    
1894 :     fig = FIG2.FIG()
1895 :    
1896 :     def die(msg):
1897 :     print msg
1898 :     sys.exit(0)
1899 :    
1900 :     if len(sys.argv) != 3:
1901 :     die("usage: connect_map_to_genes Map Genome")
1902 :    
1903 :     map = sys.argv[1]
1904 :     genome = sys.argv[2]
1905 :     name = fig.map_name(map);
1906 :    
1907 :     print "%s\t%s" % (map, name)
1908 :     roles = fig.map_to_ecs(map)
1909 :     for role in roles:
1910 :     print role
1911 :     pegs = fig.seqs_with_role(role,"master",genome)
1912 :     for peg in pegs:
1913 :     func = fig.function_of(peg);
1914 :     print " %s\t%s" % (peg, func[0][1])
1915 :     </pre></td>
1916 :     </tr>
1917 :     </table>
1918 :     </blockquote>
1919 : overbeek 1.1
1920 :     returns the full name associated with the map.
1921 :     <li>You can get the the list of functional roles contained within a map using
1922 :     <pre>
1923 :     $fig->map_to_roles($map)
1924 :     </pre>
1925 :     <li>Because the roles within a map are often just EC numbers, you will
1926 :     usually wish to "expand" them by attaching a common name for the
1927 :     function represented by the EC number. You can do this using
1928 :     <pre>
1929 :     $expanded_role = $fig->expand_ec($role)
1930 :     </pre>
1931 :     I did not do that in this example, since I wanted you to see the EC
1932 :     numbers being used as functional roles in maps. You should modify the
1933 :     program to expand the ECs and rerun it.
1934 :     <li>
1935 :     Finally, you can get a list of the PEGs within a genome that have gene functions
1936 :     which connect to a specific functional role using
1937 :     <pre>
1938 :     $fig->seqs_with_role($role,"master",$genome)
1939 :     </pre>
1940 :     </ul>
1941 :     Since maps and subsystems are remarkably similar notions (both are
1942 :     sets of functional roles), you could also construct a similar program
1943 :     for subsystems. In the SEED, subsystems do not have IDs which are
1944 :     associated with full names; rather, the full name is itself the ID of
1945 :     the subsystem. With this in mind, the following program should be
1946 :     reasonably clear:
1947 :    
1948 : olson 1.2 <p>
1949 :     <blockquote>
1950 :     <table border="1">
1951 :     <tr>
1952 :     <td>connect_subsystem_to_genes.pl</td>
1953 :     <td>connect_subsystem_to_genes.py</td>
1954 :     </tr>
1955 :     <tr>
1956 :     <td bgcolor="f0f0f0"><pre>
1957 :    
1958 : overbeek 1.1 use FIG;
1959 :     my $fig = new FIG;
1960 :     use Sim;
1961 :    
1962 :     use strict;
1963 :     my($usage,$subsystem,$genome,$name,$role,$peg,$func);
1964 :    
1965 :     $usage = "usage: connect_subsystem_to_genes Subsystem Genome";
1966 :    
1967 :     (
1968 :     ($subsystem = shift @ARGV) &&
1969 :     ($genome = shift @ARGV)
1970 :     )
1971 :     || die $usage;
1972 :    
1973 :     print "$subsystem\n";
1974 :     foreach $role ($fig->subsystem_to_roles($subsystem))
1975 :     {
1976 :     print " $role\n";
1977 :     foreach $peg ($fig->seqs_with_role($role,"master",$genome))
1978 :     {
1979 : olson 1.2 $func = $fig->function_of($peg);
1980 :     print " $peg\t$func\n";
1981 : overbeek 1.1 }
1982 :     }
1983 : olson 1.2 </pre></td>
1984 :     <td bgcolor="f0f0f0"><pre>
1985 :     from FigKernelPackages import FIG2
1986 :     import sys
1987 :    
1988 :     fig = FIG2.FIG()
1989 :    
1990 :     def die(msg):
1991 :     print msg
1992 :     sys.exit(0)
1993 :    
1994 :     if len(sys.argv) != 3:
1995 :     die("usage: connect_subsystem_to_genes Subsystem Genome")
1996 :    
1997 :     subsystem = sys.argv[1]
1998 :     genome = sys.argv[2]
1999 :    
2000 :     print subsystem
2001 :     roles = fig.subsystem_to_roles(subsystem)
2002 :     for role in roles:
2003 :     print role
2004 :     pegs = fig.seqs_with_role(role,"master",genome)
2005 :     for peg in pegs:
2006 :     func = fig.function_of(peg);
2007 :     print "%s\t%s" % ( peg, func[0][1])
2008 :     </pre></td>
2009 :     </tr>
2010 :     </table>
2011 :     </blockquote>
2012 : overbeek 1.1
2013 :     <h3>Locating genes within an organism that connect to a specific subsystem</h3>
2014 :     <hr>
2015 :     Invoking this program using
2016 :     <pre>
2017 :     connect_subsystem_to_genes 'Embden-Meyerhof and Gluconogenesis' 83333.1
2018 :     </pre>
2019 :     would produce the following output:
2020 :     <pre>
2021 :     </pre>
2022 :     Note that the functional roles in subsystems tend to be far more
2023 :     detailed (and almost never just an EC number).
2024 :     <p>
2025 :     The problem with the above program is that we have chosen to view a
2026 :     subsystem as a spreadsheet in which each cell corresponds to a
2027 :     specific functional role in a designated organism. The gene functions
2028 :     of the PEGs contained in the cell usually connect to the specific
2029 :     role, but this is not forced. If someone changes the function of a
2030 :     gene, it does not disappear from the spreadsheet, unless the
2031 :     spreadsheet is rebuilt by forcing the entries to reflect current
2032 :     assignments and how they connect to the roles. We have chosen this
2033 :     approach to allow curators of spreadsheets to have precise control
2034 :     over the contents. People who import a subsystem may or may not
2035 :     instal the gene assignments needed to make the spreadsheet
2036 :     consistent, but the spreadsheet reflects the curators best assessment
2037 :     of which genes implement specific roles. This leads to my assertion
2038 :     that the following program is a better way to display a subsystem:
2039 :    
2040 : olson 1.2 <p>
2041 :     <blockquote>
2042 :     <table border="1">
2043 :     <tr>
2044 :     <td>connect_subsystem_to_genes_the_right_way.pl</td>
2045 :     <td>connect_subsystem_to_genes_the_right_way.py</td>
2046 :     </tr>
2047 :     <tr>
2048 :     <td bgcolor="f0f0f0"><pre>
2049 :    
2050 : overbeek 1.1 use FIG;
2051 :     my $fig = new FIG;
2052 :     use Sim;
2053 :    
2054 :     use strict;
2055 :     my($usage,$subsystem,$genome,$name,$role,$peg,$func);
2056 :    
2057 :     $usage = "usage: connect_subsystem_to_genes_the_right_way Subsystem Genome";
2058 :    
2059 :     (
2060 : olson 1.2 ($subsystem = shift @ARGV) &&
2061 :     ($genome = shift @ARGV)
2062 :     )
2063 : overbeek 1.1 || die $usage;
2064 :    
2065 :     print "$subsystem\n";
2066 :     foreach $role ($fig->subsystem_to_roles($subsystem))
2067 :     {
2068 :     print " $role\n";
2069 : olson 1.2 foreach $peg ($fig->pegs_in_subsystem_cell($subsystem, $genome, $role))
2070 : overbeek 1.1 {
2071 : olson 1.2 $func = $fig->function_of($peg);
2072 :     print " $peg\t$func\n";
2073 : overbeek 1.1 }
2074 :     }
2075 : olson 1.2 </pre></td>
2076 :     <td bgcolor="f0f0f0"><pre>
2077 :     from FigKernelPackages import FIG2
2078 :     import sys
2079 :    
2080 :     fig = FIG2.FIG()
2081 :    
2082 :     def die(msg):
2083 :     print msg
2084 :     sys.exit(0)
2085 :    
2086 :     if len(sys.argv) != 3:
2087 :     die("usage: connect_subsystem_to_genes_the_right_way Subsystem Genome")
2088 :    
2089 :     subsystem = sys.argv[1]
2090 :     genome = sys.argv[2]
2091 :    
2092 :     print subsystem
2093 :     roles = fig.subsystem_to_roles(subsystem)
2094 :    
2095 :     for role in roles:
2096 :     print role
2097 :     pegs = fig.pegs_in_subsystem_cell(subsystem, genome, role)
2098 :     for peg in pegs:
2099 :     func = fig.function_of(peg)
2100 :     print " %s\t%s" % (peg, func[0][1])
2101 :     </pre></td>
2102 :     </tr>
2103 :     </table>
2104 :     </blockquote>
2105 : overbeek 1.1
2106 :     <h3>Locating genes within an organism that connect to a specific
2107 :     subsystem the right way</h3>
2108 :     <hr>
2109 :    
2110 :     <p>
2111 :     We have just illustrated how to connect maps to functional roles and
2112 :     functional roles to gene functions. Now we need to go the other way.
2113 :     That is, given a gene function, what roles would it
2114 :     connect to, and which maps and subsystems contain those roles?
2115 :     The following program illustrates the services required to answer this
2116 :     question;
2117 :    
2118 : olson 1.2 <p>
2119 :     <blockquote>
2120 :     <table border="1">
2121 :     <tr>
2122 :     <td>gf2maps_and_subs.pl</td>
2123 :     <td>gf2maps_and_subs.py</td>
2124 :     </tr>
2125 :     <tr>
2126 :     <td bgcolor="f0f0f0"><pre>
2127 : overbeek 1.1 use FIG;
2128 :     my $fig = new FIG;
2129 :    
2130 :     use strict;
2131 :     my($usage,$peg,$func,@roles,$role,@maps,$map);
2132 :     my(%in_map,@maps_it_connects_to,@subsystems_it_is_in);
2133 :     my($map_name,$subsystem);
2134 :    
2135 :     $usage = "usage: gf2maps_and_subs PEG";
2136 :    
2137 :     # The single input argument must be a FIG id, which is used to get the assigned function.
2138 :     # The program then prints a list of maps to which the gene can be connected,
2139 :     # followed by a list of subsystems that contain the gene.
2140 :    
2141 :     ($peg = shift @ARGV)
2142 :     || die $usage;
2143 :    
2144 :     if (($func = $fig->function_of($peg)) && (! &FIG::hypo($func)))
2145 :     {
2146 :     @roles = $fig->roles_of_function($func);
2147 :     foreach $role (@roles)
2148 :     {
2149 : olson 1.2 @maps = $fig->role_to_maps($role);
2150 :     foreach $map (@maps)
2151 :     {
2152 :     $in_map{$map} = 1;
2153 :     }
2154 : overbeek 1.1 }
2155 :    
2156 : olson 1.2 @maps_it_connects_to = sort keys(%in_map);
2157 : overbeek 1.1
2158 :     if (@maps_it_connects_to > 0)
2159 :     {
2160 : olson 1.2 print "Maps that connect to $func\n\n";
2161 :     foreach $map (@maps_it_connects_to)
2162 :     {
2163 :     $map_name = $fig->map_name($map);
2164 :     print "$map\t$map_name\n";
2165 :     }
2166 :     print "\n";
2167 : overbeek 1.1 }
2168 :    
2169 :     }
2170 :     else
2171 :     {
2172 :     print STDERR "you need to give a valid PEG with a non-hypothetical function to connect to maps\n";
2173 :     }
2174 :    
2175 :     @subsystems_it_is_in = $fig->peg_to_subsystems($peg);
2176 :    
2177 :     if (@subsystems_it_is_in > 0)
2178 :     {
2179 :     print "Subsystems that contain $peg\n\n";
2180 :     foreach $subsystem (@subsystems_it_is_in)
2181 :     {
2182 : olson 1.2 print "$subsystem\n";
2183 : overbeek 1.1 }
2184 :     }
2185 : olson 1.2 </pre></td>
2186 :     <td bgcolor="f0f0f0"><pre>
2187 :     from FigKernelPackages import FIG2
2188 :     import sys
2189 :    
2190 :     fig = FIG2.FIG()
2191 :    
2192 :     def die(msg):
2193 :     print msg
2194 :     sys.exit(0)
2195 :    
2196 :     if len(sys.argv) != 2:
2197 :     die("usage: gf2maps_and_subs PEG")
2198 :    
2199 :     # The single input argument must be a FIG id, which is used to get the assigned function.
2200 :     # The program then prints a list of maps to which the gene can be connected,
2201 :     # followed by a list of subsystems that contain the gene.
2202 :    
2203 :     peg = sys.argv[1]
2204 :    
2205 :     func = fig.function_of(peg)
2206 :     func = func[0][1]
2207 :     hypo = fig.hypo(func)
2208 :     in_map = {}
2209 :    
2210 :     if func and hypo[0] == 0:
2211 :     roles = fig.roles_of_function(func)
2212 :     for role in roles:
2213 :     maps = fig.role_to_maps(role)
2214 :     for map in maps:
2215 :     in_map[map] = 1
2216 :    
2217 :     maps_it_connects_to = in_map.keys()
2218 :     maps_it_connects_to.sort()
2219 :    
2220 :     if len(maps_it_connects_to) > 0:
2221 :     print "Maps that connect to %s\n" % func
2222 :     for map in maps_it_connects_to:
2223 :     map_name = fig.map_name(map)
2224 :     print "%s\t%s" % (map, map_name[0])
2225 :     print " ";
2226 :     else:
2227 :     sys.stderr.write("you need to give a valid PEG with a non-hypothetical function to connect to maps\n")
2228 :    
2229 :     subsystems_it_is_in = fig.peg_to_subsystems(peg)
2230 :    
2231 :     if len(subsystems_it_is_in) > 0:
2232 :     print "Subsystems that contain %s\n" % peg
2233 :     for subsystem in subsystems_it_is_in:
2234 :     print subsystem
2235 :     </pre></td>
2236 :     </tr>
2237 :     </table>
2238 :     </blockquote>
2239 :    
2240 : overbeek 1.1 <h3>Going from a PEG to the maps and subsystems containing it</h3>
2241 :     <hr>
2242 :     Here, again, there are some of points worth discussing:
2243 :     <ul>
2244 :     <li> First, note that the input argument is a PEG, which we use to get
2245 :     the function that we will work with. You learned how to get the
2246 :     function in a previous example, but note the use of
2247 :     <pre>
2248 :     &FIG::hypo($func)
2249 :     </pre>
2250 :     to check to see whether or not the function is equivalent to
2251 :     <i>hypothetical protein</i>.
2252 :     <p>
2253 :     The way we explore connections between the PEG and maps is quite
2254 :     different than the way we connect the PEG to subsystems. In the case
2255 :     of maps, the connection is constructed by computing the set of roles
2256 :     to which the gene function connects, and then connecting each role to
2257 :     maps. In the case of subsystems, we simply look up which subsystems
2258 :     contain the given PEG. Subsystems explicitly contain a precise list
2259 :     of the PEGs that occur within them, which is not true of maps.
2260 :    
2261 :     <li>The key services for getting at the connections are as follows:
2262 :     <pre>
2263 :     @roles = $fig->roles_of_function($func);
2264 :     @maps = $fig->role_to_maps($role)
2265 :     @subsystems = $fig->peg_to_subsystems($peg);
2266 :     </pre>
2267 :     The first takes a function as input and returns a list of roles that
2268 :     the function connects to, the second takes a role and returns a list
2269 :     of maps to which the role connects, and the last takes a peg and
2270 :     returns the list of subsystems that contain the PEG.
2271 :     </ul>
2272 :     If you take a few minutes out and think about maps and subsystems, you
2273 :     will see that they are very, very similar concepts that are treated
2274 :     quite differently by the SEED.
2275 :    
2276 :     <h2>Annotations</h2>
2277 :    
2278 :     Annotations are time-stamped text notes made by a named user and
2279 :     attached to specific features. The most common are notes recording
2280 :     when someone made an assignment of function to a PEG, but they can be
2281 :     used to record anything. The usual way to access the annotations
2282 :     attached to a given feature with an ID designated by <i>$fid</i> is to
2283 :     use
2284 :     <pre>
2285 :     @annotations = $fig->feature_annotations($fid)
2286 :     </pre>
2287 :     The returned list contains 4-tuples (i.e., pointers to lists, each of
2288 :     which contains 4 entries). Each 4-tuple contains
2289 :     <ol>
2290 :     <li> the FID (this is redundant, but it was convenient for reasons
2291 :     that will become apparent),
2292 :     <li>a timestamp,
2293 :     <li>the name of the user that made the annotation, and
2294 :     <li>the annotation in the form of a text string.
2295 :     </ol>
2296 :     The following short program uses
2297 :     <pre>
2298 :     @annotations = $fig->merged_related_annotations($fig_ids)
2299 :     </pre>
2300 :     which is a useful generalization. This routine takes a pointer to a list of
2301 :     feature IDs as input, and it produces a merged list of the annotations
2302 :     attached to all of the features. This can often be very convenient.
2303 :     For example, you might have a set of similar PEGs that are all
2304 :     misannotated, and you wish to reconstruct the steps that led to the
2305 :     current situation. By merging the annotations (which normally record
2306 :     both who made the annotations and why), you can get a record of the
2307 :     events in chronilogical order. The following short example
2308 :     illustrates this utility:
2309 :    
2310 : olson 1.2 <p>
2311 :     <blockquote>
2312 :     <table border="1">
2313 :     <tr>
2314 :     <td>annotations_of.pl</td>
2315 :     <td>annotations_of.py</td>
2316 :     </tr>
2317 :     <tr>
2318 :     <td bgcolor="f0f0f0"><pre>
2319 : overbeek 1.1 use FIG;
2320 :     my $fig = new FIG;
2321 :     use strict;
2322 :    
2323 :     my($usage,$fig_ids,@annotations);
2324 :    
2325 :     $usage = "usage: annotations_of Fid1 Fid2 ...";
2326 :    
2327 :     (@ARGV > 0)
2328 :     || die $usage;
2329 :    
2330 :     $fig_ids = [@ARGV];
2331 :    
2332 :     @annotations = $fig->merged_related_annotations($fig_ids);
2333 :     if (@annotations > 0)
2334 :     {
2335 :     &display_annotations(\@annotations);
2336 :     }
2337 :     else
2338 :     {
2339 :     print STDERR "Sorry, there were no annotations attached to the features\n";
2340 :     }
2341 :    
2342 :     sub display_annotations {
2343 :     my($annotations) = @_;
2344 :     my($annotation,$fid,$timestamp,$user,$text);
2345 :    
2346 :     foreach $annotation (@$annotations)
2347 :     {
2348 : olson 1.2 ($fid,$timestamp,$user,$text) = @$annotation;
2349 :     print "=======\n";
2350 :     print "$user\t$timestamp\t$fid\n$text\n";
2351 :     print "=======\n\n";
2352 : overbeek 1.1 }
2353 :     }
2354 : olson 1.2 </pre></td>
2355 :     <td bgcolor="f0f0f0"><pre>
2356 :     from FigKernelPackages import FIG
2357 :     import sys
2358 :    
2359 :     def display_annotations(annotations):
2360 :     for annotation in annotations:
2361 :     fid,timestamp,user,text = annotation
2362 :     print "=======\n";
2363 :     print "%s\t%s\t%s\n%s\n" % ( user, timestamp, fid, text)
2364 :     print "=======\n\n"
2365 :    
2366 :     fig = FIG.FIG()
2367 :    
2368 :     def die(msg):
2369 :     print msg
2370 :     sys.exit(0)
2371 :    
2372 :     if len(sys.argv) < 2:
2373 :     die("usage: annotations_of Fid1, Fid2, ... ")
2374 :    
2375 :     fig_ids = sys.argv[1:]
2376 :    
2377 :     annotations = fig.merged_related_annotations(fig_ids)
2378 :     if len(annotations) > 0:
2379 :     display_annotations(annotations)
2380 :     else:
2381 :     sys.stderr.write("Sorry, there were no annotations attached to the features\n")
2382 :    
2383 :     </pre></td>
2384 :     </tr>
2385 :     </table>
2386 :     </blockquote>
2387 :    
2388 : overbeek 1.1
2389 :     <h3>Accessing the annotations from a set of features</h3>
2390 :     <hr>
2391 :     Invoking this little utility using
2392 :     <pre>
2393 :     annotations_of 'fig|207559.1.peg.735' 'fig|12149.1.peg.2133' 'fig|1525.1.peg.1757'
2394 :     </pre>
2395 :     produced the following output:
2396 :     <pre>
2397 :     =======
2398 :     automated_assignment Thu Mar 18 17:42:51 2004 fig|12149.1.peg.2133
2399 :     Set master function to
2400 :     ATP phosphoribosyltransferase (EC 2.4.2.1
2401 :     =======
2402 :    
2403 :     =======
2404 :     automated_assignment Thu Mar 18 17:45:16 2004 fig|1525.1.peg.1757
2405 :     Set master function to
2406 :     ATP phosphoribosyltransferase (EC 2.4.2.1
2407 :     =======
2408 :    
2409 :     =======
2410 :     last_release Fri Apr 2 14:14:07 2004 fig|207559.1.peg.735
2411 :     Set master function to
2412 :     ATP phosphoribosyltransferase (EC 2.4.2.1
2413 :     =======
2414 :     </pre>
2415 :    
2416 :     <h2>Assigning Gene Functions</h2>
2417 :    
2418 :     We can now easily talk about assigning functions and making
2419 :     annotations to record the events. Here is a simple utility
2420 :     that can be used to make annotations:
2421 :    
2422 : olson 1.2 <p>
2423 :     <blockquote>
2424 :     <table border="1">
2425 :     <tr>
2426 :     <td>batch_assignments.pl</td>
2427 :     </tr>
2428 :     <tr>
2429 :     <td bgcolor="f0f0f0"><pre>
2430 : overbeek 1.1 use FIG;
2431 :     my $fig = new FIG;
2432 :    
2433 :     use strict;
2434 :     my($usage,$user,$annotation,$text,$peg,$func,$conf,$user_name,$funcO);
2435 :    
2436 :     $usage = "usage: batch_assignments User [AnnotationFile] < Assignments";
2437 :    
2438 :     ($user = shift @ARGV)
2439 :     || die $usage;
2440 :    
2441 :     if ($annotation = shift @ARGV)
2442 :     {
2443 :     $text = join("",`cat $annotation`);
2444 :     }
2445 :     else
2446 :     {
2447 :     $text = "";
2448 :     }
2449 :    
2450 :     while (defined($_ = <STDIN>))
2451 :     {
2452 :     chop;
2453 :     ($peg,$func,$conf) = split(/\t/,$_);
2454 :     if (! $conf) { $conf = "" }
2455 : golsen 1.4
2456 :     $funcO = $fig->function_of($peg);
2457 :     if ($funcO ne $func)
2458 : overbeek 1.1 {
2459 : golsen 1.4 # annotations are now made by assign_function
2460 :     $fig->assign_function($peg,$user,$func,$conf);
2461 : overbeek 1.1 }
2462 :     }
2463 :    
2464 : olson 1.2 </pre></td>
2465 :     </tr>
2466 :     </table>
2467 :     </blockquote>
2468 : overbeek 1.1
2469 :     <h3>A utility to make batch assignments</h3>
2470 :     <hr>
2471 :     Before discussing the details of this code, we should discuss the
2472 :     issue of "master and non-master users". Each gene can have a single
2473 :     <i>master assignment</i> and any number of <i>non-master
2474 :     assignments</i>. A user who has signed in as <i>master:USER</i>
2475 :     (e.g., "master:RossO") is a master user. If a master user makes an
2476 :     assignment, it overwrites the master assignment. If a non-master user
2477 :     makes an assignment, it alters only the assignment corresponding to
2478 :     that user, if there is one. When a master asks for the function of a
2479 :     gene, he gets the master assignment. If a non-master asks for the
2480 :     function of a gene, he will get the assignment corresponding to his
2481 :     user id, if there is one; if not, he will get the master assignment,
2482 :     if there is one.
2483 :     <p>
2484 :     There is also the matter of <i>confidence levels</i>. The SEED
2485 :     supports attaching confidence codes to assignments, although they are
2486 :     seldom used. The codes we use at this point are:
2487 :     <ul>
2488 :     <li> an empty string, a space or <b>N</B> means "normal or average confidence",
2489 :     <li> an <b>H</b> or an <b>S</B> means "high or strong confidence", and
2490 :     <li>an <b>L</b> or a <b>W</b> means "low or weak confidence".
2491 :     </ol>
2492 :     We are not really employing the codes actively. The major source of
2493 :     strength in an assignment is being part of a subsystem or a
2494 :     well-characterized class of proteins. This will clarify substantially
2495 :     as the Project to Annotate 1000 Genomes progresses.
2496 :     <p>
2497 :     Now, back to the example program:
2498 :     <ul>
2499 :     <li>
2500 :     The program is written to allow the
2501 :     user to optionally attach a block of annotation to each of the genes
2502 :     that get reassigned. If the optional argument (the second argument)
2503 :     is prtesent, it is presumed to be the name of a file containing text
2504 :     to be appended to the annotation the program makes to record the
2505 :     change of assignment. The line of code
2506 :     <pre>
2507 :     $text = join("",`cat $annotation`);
2508 :     </pre>
2509 :     just reads the entire file, concatenates the lines, and assigns the
2510 :     result to the variable <i>$text</i>.
2511 :     <li>
2512 :     We default the <i>confidence level</i> to the empty string.
2513 :     <li> We only make asssignments if the current function differs from
2514 :     the one to be assigned. <i>$funcO</i> is used to hold the existing
2515 : golsen 1.4 asssignment.
2516 : overbeek 1.1 <li>
2517 :     The
2518 :     <pre>
2519 :     $fig->assign_function($peg,$user,$func,$conf);
2520 :     </pre>
2521 : golsen 1.4 is used to actually make the assignment in the database, record it in in the
2522 :     current assignments file, and record an annotation of the annotation history
2523 :     file.
2524 : overbeek 1.1 </ul>
2525 :    
2526 :     <h2>BBHs</h2>
2527 :    
2528 :     Now, let us return briefly to the topic of similarities. The notion
2529 :     of <i>bi-directional best hit</i> (<b>BBH</b>) often pops up in
2530 :     discussions relating to similarities and how to project assignments
2531 :     from a gene in one genome to a gene in another. A gene <i>X</i> in
2532 :     genome <i>Gx</i> is a BBH of a gene <i>Y</i> in <i>Gy</i> iff
2533 :     <ol>
2534 :     <li><i>X</i> is similar to <i>Y</i>,
2535 : redwards 1.3 there is no other gene in <i>Gy</i> more similar to <i>X</i>, and
2536 :     there is no other gene in <i>Gx</i> more similar to <i>Y</i>.
2537 : overbeek 1.1 </ol>
2538 :     To get the BBHs for a given PEG, you would use the something like the
2539 :     following:
2540 :     <pre>
2541 :     @bbhs = $fig->bbhs($peg,1.0e-10);
2542 :     </pre>
2543 :     This would set <i>@bbhs</i> to a list of 2-tuples. Each 2-tuple would
2544 :     contain a PEG and a p-score. They would correspond to BBHs which have
2545 :     similarity better than 1.0e-10.
2546 :    
2547 :     <h2>The Use of dsims</h2>
2548 :    
2549 :     We have constructed a means of computing <i>dynamic similarities</i>
2550 :     fairly rapidly. The basic idea involves breaking the nonredundant
2551 :     protein database into many thousands of subsets, where all of the
2552 :     sequences in a subset are quite similar (sequences that do not fall
2553 :     into subsets get placed into a <i>nohits database</i>). Then, one or more
2554 :     representative sequences from each subset are included in a database
2555 :     of <i>exemplars</i>.
2556 :     <p>
2557 :     To get <i>blastp</i> similarities quickly (and somewhat less
2558 :     accurately than just blasting against the nonredundant database), use
2559 :     something like
2560 :     <pre>
2561 :     @sims = $fig->dsims($id,$seq,200,1.0e-10,"raw")
2562 :     </pre>
2563 :     This would return the similarities that can rapidly be detected
2564 :     between the protein sequence with id given by <i>id</i> and sequence
2565 :     given by <i>$seq</i>. It would give you at most 200 similarities
2566 :     with p-scores better than 1.0e-10, and you would get only the
2567 :     <i>raw</i> similarities (<i>all</i> would return the expanded
2568 :     similarities).
2569 :    
2570 :     <h1> PAUSE: I am pausing here; I hope to finish this tutorial within a few months</h1>
2571 :    
2572 :     <h2>Clusters, Pins, and Largest Clusters</h2>
2573 :    
2574 :     <h2>The Indexes</h2>
2575 :    
2576 :     <h2>Administrative Issues</h2>
2577 :    
2578 :     <h2>Atomated Assignment of Function</h2>
2579 :    
2580 :     <h2>Finding Candidates for a Functional Role</h2>
2581 :    
2582 :     <h2>Protein Families</h2>
2583 :    
2584 :     <h2>Compounds and Reactions</h2>
2585 :    
2586 :     <h2>Subsystems</h2>
2587 :    
2588 :     <h2>Links to PEGs</h2>
2589 :    
2590 :     <h2>Miscellaneous Stuff that You Should Know</h2>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3