[Bio] / FigTutorial / API.html Repository:
ViewVC logotype

Annotation of /FigTutorial/API.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (view) (download) (as text)

1 : overbeek 1.1 <h1>Accessing SEED Services</h1>
2 :     <h2>Introduction</h2>
3 :     This document is designed to help a programmer understand how to
4 :     access the services provided by the SEED. It is not designed for the
5 :     user who wishes to confine his experience to working with the SEED
6 :     browser over the Web.
7 :     <p>
8 :     Before we launch into a discussion of the details of how to invoke
9 :     specific routines, there are some background issues that you should
10 :     know about. First, let's talk about <b>fig</b>, which is a
11 :     command-line utility that is used for a number of purposes. Its most
12 :     basic purpose is to exercise the API and illustrate what data is
13 :     accessed by each operation. However, you may also find it extremely
14 :     useful as a quick way to invoke services. To try it out, type
15 :     <pre>
16 :     fig
17 :     </pre>
18 :     This should produce a prompt of the form
19 :     <pre>
20 :     ??
21 :     </pre>
22 :     You should type
23 :     <pre>
24 :     ?? h&lt;cr&gt;
25 :     </pre>
26 :     to get a listing of the commands
27 :     supported by the utility.
28 :     At the time this document was written (May, 2004), the output was as follows:
29 :     <pre>
30 :     <b>?? h</b>
31 :     DB Current DB
32 :     abbrev genome_name
33 :     access_sims [expand|func|expandfunc] (for timeing - access 10 distinct sims)
34 :     add_annotation FID User [ prompted for annotation; terminate with "." at start of line ]
35 :     add_chromosomal_clusters File
36 :     add_genome GenomeDir
37 :     add_pch_pins File
38 :     add_peg_link PEG Link (e.g., &lt;a href=http//...&gt;link to somewhere&lt;/a&gt;)
39 :     all_compounds
40 :     all_features GenomeID Type
41 :     all_maps
42 :     all_protein_families
43 :     all_reactions
44 :     all_roles
45 :     all_sets Relation SetName
46 :     assign_function PEG User [conf=X] Function
47 :     assign_functionF User File [no_annotations]
48 :     assignments_made who date G1 G2 ... [ none => all ]
49 :     auto_assign PEG [Seq]
50 :     auto_assignF User FileOfPEGs
51 :     auto_assignG User GenomeID1 GenomeID2 ...
52 :     bbhs PEG Cutoff
53 :     between n1 n2 n3
54 :     blast PEG [against nr]
55 :     blastit PEG seq db maxP
56 :     boundaries_of Loc
57 :     build_tree_of_complete min_for_label
58 :     by_alias alias
59 :     by_fig_id FID1 FID2
60 :     candidates_for_role Genome Cutoff Role
61 :     cas CID
62 :     cas_to_cid cas
63 :     catalyzed_by RID
64 :     catalyzes role
65 :     cgi_url
66 :     clean_tmp
67 :     close_genes FID distance
68 :     comp2react CID
69 :     contig_ln GenomeID Contig
70 :     coupling_and_evidence FID Bound SimCutoff CouplingCutoff
71 :     crude_estimate_of_distance GenomeID1 GenomeID2
72 :     delete_all_peg_links PEG
73 :     delete_genomes G1 G2 G3 ...Gn
74 :     delete_peg_link PEG URL
75 :     displayable_reaction RID
76 :     dna_seq GenomeID Loc
77 :     dsims FastaFile [MaxN MaxPsc Select] *** select defaults to raw
78 :     ec_to_maps EC
79 :     ec_name EC
80 :     expand_ec EC
81 :     epoch_to_readable [gives readable version of current time]
82 :     export_chromosomal_clusters
83 :     export_pch_pins
84 :     export_set Relation SetName File
85 :     exportable_subsystem Subsystem
86 :     external_calls PEG1 PEG2 PEG3...
87 :     extract_seq ContigsFile loc
88 :     all_exchangable_subsystems
89 :     best_bbh_candidates Genome Cutoff Requested Known
90 :     family_function Family
91 :     fast_coupling FID Bound CouplingCutoff
92 :     feature_aliases FID
93 :     feature_annotations FID
94 :     feature_location FID
95 :     file2N File
96 :     ftype FID
97 :     function_of ID [all] [trans]
98 :     genes_in_region GenomeID Contig Beg End
99 :     genome_of PEG
100 :     genomes [complete|Pat]
101 :     genome_counts [complete]
102 :     genome_version GenomeID
103 :     genus_species GenomeID
104 :     get_translation ID
105 :     get_translations File [tab]
106 :     h [command] *** h h for help on how to use help ***
107 :     hypo Function
108 :     ids_in_family Family
109 :     ids_in_set WhichSet Relation SetName
110 :     in_cluster_with PEG
111 :     in_family FID
112 :     in_pch_pin_with FID
113 :     in_sets Which Relation SetName
114 :     is_archaeal GenomeID
115 :     is_bacterial GenomeID
116 :     is_eukaryotic GenomeID
117 :     is_prokaryotic GenomeID
118 :     is_exchangable_subsystem Subsystem
119 :     is_real_feature FID
120 :     largest_clusters FileOfRoles user
121 :     load_all
122 :     map_to_ecs map
123 :     map_name map
124 :     mapped_prot_ids ID
125 :     maps_to_id ID
126 :     max n1 n2 n3 ...
127 :     merged_related_annotations PEG1 PEG2 ... PEGn
128 :     min n1 n2 n3 ...
129 :     names_of_compound
130 :     neighborhood_of_role role
131 :     org_of prot_id
132 :     pegs_of GenomeID
133 :     possibly_truncated FID
134 :     reaction2comp RID
135 :     related_by_func_sim PEG user
136 :     reversible RID
137 :     rnas_of GenomeID
138 :     roles_of_function function
139 :     same_func 'function1' 'function2'
140 :     search_index pattern
141 :     seqs_with_role role who
142 :     seqs_with_roles_in_genomes genomes roles made_by
143 :     sims PEG maxN maxP select
144 :     sort_fids_by_taxonomy FID1 FID2 FID3 ...
145 :     sort_genomes_by_taxonomy GenomeID1 GenomeID2 GenomeID3 ...
146 :     sz_family family
147 :     taxonomic_groups_of_complete min_for_labels
148 :     taxonomy_of GenomeID
149 :     translatable prot_id
150 :     translate_function PEG user
151 :     translated_function_of PEG user
152 :     translation_length ID
153 :     unique_functions PEGs user [make the pegs comma separated]
154 :     verify_dir dir
155 :    
156 :     </pre>
157 :     If you quickly browse through the commands that are supported by <b>fig</b>,
158 :     you should get a fairly complete overview of the services provided by the API.
159 :     We are not going to cover the use of <b>fig</b> in detail in this
160 :     tutorial, but we do encourage you to experiment with it. By the way,
161 :     you should type <b>x</b> to exit from it.
162 :     <p>
163 :     <b>fig</b> is ususally accessed in interactive mode (invoked with no
164 :     arguments). However, if you wish to invoke it to process a single
165 :     command, you can just put the command as a single argument on the
166 :     command-line. Thus,
167 :     <pre>
168 :     fig 'add_peg_link fig|188.1.peg.96 &lt;a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/&gt;animation&lt;/a&gt;'
169 :     </pre>
170 :     would add a link to PEG <i>fig|188.1.peg.96</i> while
171 :     <pre>
172 :     fig 'delete_all_peg_links fig|188.1.peg.96'
173 :     </pre>
174 :     would delete all links from the PEG. Since adding links to PEGs is
175 :     something you might wish to do, the easiest way to do it is to just
176 :     create a file of commands of the sort shown, make the file executable,
177 :     and then run it.
178 :     <p>
179 :     If you are a Perl programmer, eventually you will wish to peruse the
180 :     <b>fig</b> source code to see how it is structured, how to add to it,
181 :     and how to invoke services via the API.
182 :     <p>
183 :     We have briefly discussed the utility <b>fig</b>. Now it is time to
184 :     discuss writing your own programs using the API. Basically, the
185 :     entire API is encapsulated in the Perl module <b>FIG.pm</b>. To illustrate
186 :     the style of programming we recommend, let's create a simple program
187 :     to add a link to a PEG (rather than using <b>fig</b>).
188 :     Adding a link
189 :     to a PEG will cause the link to be displayed whenever anyone examines
190 :     the PEG via the SEED web services..
191 :     The API defines a set of services that do many things; adding a link
192 :     to a PEG is just one very minor service, but understanding exactly how one
193 :     could invoke such a service is what this tutorial is about.
194 :     To begin, use
195 :     <pre>
196 :     tool_hdr add_link
197 :     </pre>
198 :     This creates a file called <i>add_link</i> and makes it executable.
199 :     The file contains an initial set of Perl commands designed to
200 :     establish the correct execution environment (where the libraries are,
201 :     and so forth). You need to add your code to the end of the
202 :     initialized file. We recommend adding the following lines:
203 :     <hr>
204 :     <pre>
205 :     use FIG;
206 :     my $fig = new FIG;
207 :    
208 :     $usage = "usage: add_link PEG Link";
209 :    
210 :     (
211 :     ($peg = shift @ARGV) &&
212 :     ($link = shift @ARGV)
213 :     )
214 :     || die $usage;
215 :    
216 :     $fig->add_peg_links([$peg,$link]);
217 :     </pre>
218 :     <h3>A short program to add a link to a PEG</h3>
219 :     <hr>
220 :     We urge you to study this short program carefully, since it reflects
221 :     the style we will use throughout this tutorial. There are a number of
222 :     things worth noting:
223 :     <ul>
224 :     <li>
225 :     The lines
226 :     <pre>
227 :     use FIG;
228 :     my $fig = new FIG;
229 :     </pre>
230 :     initialize <i>$fig</i>. This is your key to the fig services. All
231 :     API services are invoked using
232 :     <pre>
233 :     $fig->rtn(arguments)
234 :     </pre>
235 :     <li>
236 :     The last line of the program is the actual invocation of the service
237 :     to add a link to a peg. Actually, <i>add_peg_links</i> is a service
238 :     routine that takes any number of arguments. Each argument is a
239 :     2-tuple (i.e., a pointer to a list containing two elements). Each
240 :     2-tuple contains a PEG and a link. Links should be of the form:
241 :     <pre>
242 :     &lt;a href=URL&gt;LABEL&lt;/a&gt;
243 :     </pre>
244 :     </ul>
245 : olson 1.2
246 :     We suggest that you type in this program and get it to run on your
247 : overbeek 1.1 version of the SEED. Then, use <b>fig</b> to delete any links you
248 :     added.
249 : olson 1.2
250 : overbeek 1.1 <h2>Accessing Data Relating to Genomes</h2>
251 : olson 1.2
252 : overbeek 1.1 Let us now begin a somewhat more systematic exploration of the services offered by the API.
253 :     We will begin with genome-related services, of which there are relatively few in the SEED.
254 :     Essentially, genomes are represented by IDs of the form <i>xxx.y</i>
255 :     where <i>xxx</i> is the NCBI taxon ID, and <i>y</i> is a suffix to
256 :     disambiguate situations in which you have multiple genomes with the
257 :     same taxon ID.
258 : olson 1.2 <p>
259 :     <blockquote>
260 :     <table border="1">
261 :     <tr>
262 :     <td>genomes.ex.pl</td>
263 :     <td>genomes.ex.py</td>
264 :     </tr>
265 :     <tr>
266 :     <td bgcolor="f0f0f0"><pre>
267 : overbeek 1.1
268 :     use strict;
269 :     use FIG;
270 :     my $fig = new FIG;
271 :    
272 :     my(%sofar,$genome,$which,$abbrev,$version);
273 :    
274 :     foreach $genome ($fig->genomes("complete"))
275 :     {
276 :     $which = $fig->genus_species($genome);
277 :     $abbrev = &get_unique_abbreviation($which,\%sofar);
278 :     $version = $fig->genome_version($genome);
279 :     print "$genome\t$abbrev\t$version\t$which\n";
280 :     }
281 :    
282 :     sub get_unique_abbreviation {
283 :     my($which,$sofar) = @_;
284 :     my($nonunique,$n);
285 :    
286 :     $nonunique = &FIG::abbrev($which);
287 :     $nonunique =~ s/ //g;
288 :     $nonunique =~ s/\.+$//;
289 :     $n = $sofar->{$nonunique};
290 :     if (! defined($n))
291 :     {
292 :     $n = $sofar->{$nonunique} = 1;
293 :     }
294 :     $sofar->{$nonunique}++;
295 :     return $nonunique . ".$n";
296 :     }
297 : olson 1.2 </pre></td>
298 :     <td bgcolor="f0f0f0"><pre>
299 :     from FigKernelPackages import FIG
300 :     import sys
301 :     import string
302 :    
303 :    
304 :     def die(msg):
305 :     print msg
306 :     sys.exit(0)
307 :    
308 :     fig = FIG.FIG()
309 :     sofar = {}
310 :    
311 :     def get_unique_abbreviation(which, sofar):
312 :     nonunique = fig.abbrev(which);
313 :     nonunique = string.rstrip(string.replace(nonunique," ", ""), ".")
314 :     if not sofar.has_key(nonunique):
315 :     sofar[nonunique] = 1
316 :     n = sofar[nonunique]
317 :     sofar[nonunique] = sofar[nonunique] + 1
318 :     return (nonunique+"."+str(n))
319 :    
320 :     for genome in fig.genomes("complete"):
321 :     which = fig.genus_species(genome)
322 :     abbrev = get_unique_abbreviation(which,sofar);
323 :     version = fig.genome_version(genome);
324 :     print "%s\t%s\t%s\t%s" % (genome, abbrev, version, which)
325 :    
326 :     </pre></td>
327 :     </tr>
328 :     </table>
329 :     </blockquote>
330 :    
331 : overbeek 1.1 <h3>An illustration of genome-related services</h3>
332 : olson 1.2
333 : overbeek 1.1 <hr>
334 :     There are a few points worth noting about the short example:
335 :     <ul>
336 :     <li>
337 : olson 1.2
338 : overbeek 1.1 <i>$fig->genomes</i> returns a list of all of the genome IDs. It has
339 :     two arguments that can be omitted. The first is used to request only
340 :     complete (or near complete) genomes (this is used in the example).
341 :     For each genome in the SEED there is a directory in
342 :     <i>$FIG_Config::organisms</i> that encapsulates data about the genome.
343 :     Genomes will be considered "complete" iff there exists a file
344 :     <i>$FIG_Config::organisms/$genome/COMPLETE</i> (where $genome contains the
345 :     genome ID). The second argument can be used to restrict the returned
346 :     list of genomes to those that have restrictions on distribution
347 :     (indicated by the presence of
348 :     <i>$FIG_Config::organisms/$genome/RESTRICTIONS</i>). Thus,
349 :     <i>$fig->genomes("complete","restricted")</i> could be used to get a
350 :     list of complete, but restricted, genomes.
351 :     <li>
352 :     There are a number of routines contained in <i>FIG.pm</i> which are
353 :     not really services, but are there because they are simply offer
354 :     convenient subroutines. They probably should not be part of an API.
355 :     For example <i>&FIG::between($x,$y,$z)</i> can be invoked to return
356 :     "true" iff <i>$y</i> is between <i>$x</i> and <i>$z</i>.
357 :     <i>&FIG::abbrev($which)</i> invokes a simple routine that produces an
358 :     abbrevbiation of a genome description. It will usually contain
359 :     spaces and sometimes trailing periods, so if you want an abbreviation without these, you have to get
360 :     rid of them. Further, the abbreviations are not unique (that is,
361 :     invoking this routine for two distinct genomes can produce the same
362 :     result). Hence, we give a simple little routine that produces unique
363 :     abbreviations (suitable, for example, to use for sequences in an
364 :     alignment).
365 :     <li>
366 :     Finally, it is worth noting that, as of now, we think of a genome
367 :     "version" as composed of a genome ID concatenated with a checksum.
368 :     This may or may not change in the future.
369 :     </ul>
370 :     Here are just a few lines from the output of this little program:
371 :     <pre>
372 :     197221.1 The.elon.BP.1 197221.1.3406724106 Thermosynechococcus elongatus BP-1
373 :     198094.1 Bac.anth.st.2 198094.1.2593352402 Bacillus anthracis str. Ames
374 :     198214.1 Shi.flex.2a.1 198214.1.1464268764 Shigella flexneri 2a str. 301
375 :     198215.1 Shi.flex.2a.2 198215.1.1756034760 Shigella flexneri 2a str. 2457T
376 :     </pre>
377 :     What other genome-related services does the SEED offer? The only ones
378 :     worth mentioning here relate to <i>contigs</i>. Here is a short
379 :     example illustrating how to use those services. Basically, you can
380 :     <ol>
381 :     <li>
382 :     get a list of the contigs in a given genome,
383 :     <li>
384 :     get the length of a specified contig, and
385 :     <li>
386 :     extract sequence from a contig.
387 :     </ol>
388 :     To illustrate these features, here is a short program that list all of
389 :     the contigs for a designated genome. For each contig that is over 50
390 :     characters in size (and, they all should be), it displays the contig
391 :     ID, the contig length, and the first 50 characters of sequence.
392 : olson 1.2 <p>
393 :     <blockquote>
394 :     <table border="1">
395 :     <tr>
396 :     <td>contigs_for_genome.pl</td>
397 :     <td>contigs_for_genome.py</td>
398 :     </tr>
399 :     <tr>
400 :     <td bgcolor="f0f0f0"><pre>
401 : overbeek 1.1
402 :     use strict;
403 :     use FIG;
404 :     my $fig = new FIG;
405 :    
406 :     my $usage = "usage: contigs_for_genome Genome";
407 :    
408 :     my($genome,@contigs,$contig,$len,$seq);
409 :    
410 :     ($genome = shift @ARGV)
411 :     || die $usage;
412 :    
413 :     @contigs = $fig->all_contigs($genome);
414 :     foreach $contig (@contigs)
415 :     {
416 :     $len = $fig->contig_ln($genome,$contig);
417 :     if ($len >= 50)
418 :     {
419 :     $seq = $fig->dna_seq($genome,$contig . "_1_50");
420 :     print "$contig\t$len\t$seq\n";
421 :     }
422 :     }
423 :    
424 : olson 1.2 </pre></td>
425 :     <td bgcolor="f0f0f0"><pre>
426 :     from FigKernelPackages import FIG
427 :     import sys
428 :    
429 :     fig = FIG.FIG()
430 :    
431 :     def die(msg):
432 :     print msg
433 :     sys.exit(0)
434 :    
435 :     if len(sys.argv) != 2:
436 :     die("usage: contigs_for_genome Genome")
437 :    
438 :     genome = sys.argv[1]
439 :    
440 :     contigs = fig.all_contigs(genome)
441 :    
442 :     for contig in contigs:
443 :     len = fig.contig_ln(genome,contig)
444 :     if len >= 50:
445 :     seq = fig.dna_seq(genome,contig+"_1_50")
446 :     print contig+"\t"+len+"\t"+seq
447 :    
448 :     </pre></td>
449 :     </tr>
450 :     </table>
451 :     </blockquote>
452 :    
453 : overbeek 1.1 <h3>How to access services related to contigs</h3>
454 :     <hr>
455 :     The services being used in this short example are as follows:
456 :     <ul>
457 :     <li>
458 :     <i>$fig->all_contigs($genome)</i> returns a list of the contigs that
459 :     make up the designated genome,
460 :     <li>
461 :     <i>$fig->contig_ln($genome,$contig)</i> returns the length of the
462 :     designated contig, and
463 :     <li>
464 :     <i>$fig->dna_seq($genome,$contig . "_1_50")</i> returns the first 50
465 :     characters of the contig.
466 :     </ul>
467 :     This last service (<i>dna_seq</i>) requires some explanation. You
468 :     need to know that
469 :     <ol>
470 :     <li>
471 :     Locations on a contig begin with 1 (not 0 as some computer scientists
472 :     would prefer).
473 :     <li>
474 :     Within the SEED, locations are represented as a comma-separated list
475 :     of <i>contiguous regions</i>. A contiguous region is represented by a
476 :     contig ID, a beginning
477 :     position, and an ending position all separated by underscores. One
478 :     fact that can be a little confusing is that contig IDs themselves may
479 :     contain underscores. Thus,
480 :     <pre>
481 :     NC_000913_1_50
482 :     </pre>
483 :     would be used to represent the first fifty chracters of the contig
484 :     NC_000913.
485 :     <li>
486 :     Sections of sequence on the complementary strand are represented
487 :     similarly (with the beginning point greater than the ending
488 :     position). Thus,
489 :     <pre>
490 :     NC_000913_50_1
491 :     </pre>
492 :     would represent the sequence on the complementary strand of NC_000913
493 :     going from position 50 to position 1.
494 :     <li>
495 :     The routine <i>dna_seq</i> will accept a location (i.e., a
496 :     comma-separated list of contiguous regions), or even just a list of
497 :     arguments each representing a contiguous region, and return
498 :     the concatenation of the sequences from those locations. Thus,
499 :     <pre>
500 :     $fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100")
501 :     or
502 :     $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")
503 :     </pre>
504 :     would return 99 characters formed by deleting character 51 from the
505 :     first 100 characters of NC_000913.
506 :     </ol>
507 :    
508 :     <h2>Accessing Data Relating to Features</h2>
509 :    
510 :     The SEED was designed, like WIT and ERGO before it, to allow a variety
511 :     of types of features. In fact, the SEED is essentially focused on
512 :     protein-encoding genes, which I perversely named <i>pegs</i> rather
513 :     than <i>CDSs</i>. It is true that WIT contained RNAs, inteins,
514 :     introns, and repeat sequences, and ERGO contained RNAs and IS
515 :     elements; however, the SEED really only has pegs and RNAs at this
516 :     point (and, the RNAs are not adequately handled).
517 :     <p>
518 :     There is a fairly rich set of services for pegs, as you might guess
519 :     from the web services offered by the SEED. As an introduction to some
520 :     of these capabilities, consider the following short program:
521 :    
522 : olson 1.2 <p>
523 :     <blockquote>
524 :     <table border="1">
525 :     <tr>
526 :     <td>features_around.pl</td>
527 :     <td>features_around.py</td>
528 :     </tr>
529 :     <tr>
530 :     <td bgcolor="f0f0f0"><pre>
531 :    
532 : overbeek 1.1 use strict;
533 :     my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg);
534 :     my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region);
535 :     my($start_of_leftmost_feature,$end_of_rightmost_feature);
536 :     my($genome);
537 :    
538 :     use FIG;
539 :     my $fig = new FIG;
540 :    
541 :     $usage = "usage: features_around ID";
542 :    
543 :     ($id = shift @ARGV)
544 :     || die $usage;
545 :    
546 :     if ($peg = $fig->by_alias($id))
547 :     {
548 :     if ($loc = $fig->feature_location($peg))
549 :     {
550 :     ($contig,$beg,$end) = $fig->boundaries_of($loc);
551 :     if (defined($contig))
552 :     {
553 :     $start_reg = &FIG::min($beg,$end) - 5000;
554 :     $end_reg = &FIG::max($beg,$end) + 5000;
555 :     $genome = &FIG::genome_of($peg);
556 :     ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
557 :     $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
558 :     foreach $fid (@$features_in_region)
559 :     {
560 :     $loc1 = $fig->feature_location($fid);
561 :     $aliases1 = $fig->feature_aliases($fid);
562 :     $trunc = $fig->possibly_truncated($fid);
563 :    
564 :     $pseq = $func = $prot_ln = "";
565 :    
566 :     if (&FIG::ftype($fid) eq "peg")
567 :     {
568 :     if ($prot_ln = $fig->translation_length($fid))
569 :     {
570 :     $pseq = $fig->get_translation($fid);
571 :     $func = $fig->function_of($fid);
572 :     }
573 :     }
574 :     print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
575 :     }
576 :     }
577 :     }
578 :     else
579 :     {
580 :     print STDERR "Sorry, could not get the location of $fid\n";
581 :     }
582 :     }
583 :     else
584 :     {
585 :     print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
586 :     }
587 : olson 1.2 </pre></td>
588 :     <td bgcolor="f0f0f0"><pre>
589 :     from FigKernelPackages import FIG
590 :     import sys
591 :     import string
592 :    
593 :     fig = FIG.FIG()
594 :    
595 :     def die(msg):
596 :     print msg
597 :     sys.exit(0)
598 :    
599 :     if len(sys.argv) != 2:
600 :     die("usage: features_around ID")
601 :    
602 :     id = sys.argv[1]
603 :    
604 :     peg = fig.by_alias(id)
605 :     if peg:
606 :     loc = fig.feature_location(peg)
607 :     if loc:
608 :     try:
609 :     (contig,beg,end) = fig.boundaries_of(loc)
610 :     start_reg = int(fig.min(beg,end)) - 5000
611 :     end_reg = int(fig.max(beg,end)) + 5000
612 :     genome = fig.genome_of(peg)
613 :     (features_in_region,start_of_leftmost_feature,end_of_rightmost_feature) = fig.genes_in_region(genome,contig,start_reg,end_reg);
614 :     for fid in features_in_region:
615 :     loc1 = fig.feature_location(fid)
616 :     aliases1 = fig.feature_aliases(fid)
617 :     trunc = fig.possibly_truncated(fid)
618 :    
619 :     pseq = func = prot_ln = ""
620 :    
621 :     if fig.ftype(fid) == "peg":
622 :     prot_ln = fig.translation_length(fid)
623 :     if prot_ln:
624 :     pseq = fig.get_translation(fid)
625 :     func = fig.function_of(fid)
626 :     print string.join([fid, loc1, str(aliases1), str(trunc), str(func), str(prot_ln), pseq], "\t")
627 :     #print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
628 :     except:
629 :     pass
630 :     else:
631 :     sys.stderr.write("Sorry, could not get the location of %s\n" % fid)
632 :     #print STDERR "Sorry, could not get the location of $fid\n";
633 :     else:
634 :     sys.stderr.write("Sorry, could not figure out which PEG you meant by %s\n" % id)
635 :     #print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
636 :     </pre></td>
637 :     </tr>
638 :     </table>
639 :     </blockquote>
640 : overbeek 1.1 <h3>Displaying features close to a given feature</h3>
641 :     <hr>
642 :     This short program takes as an argument an ID that corresponds to a
643 :     feature. The ID can be a FIG feature ID of the form
644 :     <pre>
645 :     fig|xxxx.y.peg.zzzz
646 :     </pre>
647 :     where <i>xxxx.y</i> is the genome ID and <i>zzzz</i> is a unique
648 :     sequence number (these usually occur in the same order as the genes
649 :     occur on the contigs, but this is not guaranteed to be true). The
650 :     argument could also be any other ID known by the SEED that uniquely
651 :     corresponds to the FIG ID. For example,
652 :     <pre>
653 :     features_around 'sp|Q9TL34'
654 :     </pre>
655 :    
656 :     would display twelve features (<i>fig|31312.1.peg.1</i> through
657 :     <i>fig|31312.1.peg.12</i>, since <i>sp|Q9TL34</i> corresponds to
658 :     <i>fig|31312.1.peg.5</i>. The program outputs a tab-separated file
659 :     of the sort one might use as input to a spreadsheet like Excel. The
660 :     columns in the output contain:
661 :    
662 :     <ol>
663 :     <li>
664 :     the FIG ID,
665 :     <li>
666 :     the location of the feature,
667 :     <li>
668 :     a list of aliases for the gene,
669 :     <li>
670 :     a flag indicating whether or not the gene occurs near the end of the
671 :     contig (and, hence, is potentially truncated),
672 :     <li>
673 :     the function currently assigned to the gene,
674 :     <li>
675 :     the length of the protein sequence encoded by the gene, and
676 :     <li>
677 :     the protein sequence encoded by the gene.
678 :     </ol>
679 :     Of course, some of the features might not be PEGs, and in that case
680 :     the last three fields are just empty.
681 :     <p>
682 :     Now, let's go through some of the details in the code:
683 :     <ul>
684 :     <li>
685 :     The programs begins by trying to locate the FIG ID corresponding to
686 :     the ID given as the command-line argument. The expression
687 :     <pre>
688 :     $fig->by_alias($id)
689 :     </pre>
690 :     returns the FIG ID when it can be determined (otherwise,
691 :     <i>undef</i>).
692 :     <li>
693 :     Once the FIG ID for the PEG has been determined, the location is
694 :     requested using
695 :     <pre>
696 :     $fig->feature_location($peg)
697 :     </pre>
698 :     Remember, a location is a sequence of contiguous regions separated by
699 :     commas (often a single contiguous region); a contiguous region
700 :     amounts to a contig ID, a beginning position, and an ending position
701 :     separated by underscores. In some contexts, all you want is the
702 :     region containing a gene (without the details of intron/exon
703 :     boundaries). In this case, you can use
704 :     <pre>
705 :     ($contig,$beg,$end) = $fig->boundaries_of($loc);
706 :     </pre>
707 :     to convert the location to a single contiguous region. These will
708 :     succeed only if the sequence of contiguous regions all have the same
709 :     contig ID. <i>$beg</i> will be set to the start of the first exon,
710 :     while <i>$end</i> will be set to the end of the last exon.
711 :     <li>
712 :     Once the boundaries of the gene have been determined, the program asks
713 :     for all genes that overlap a region starting 5000 characters (bases,
714 :     if you like) to the left of the gene and ending 5000 characters to the
715 :     right of the gene. This is done by using
716 :     <pre>
717 :     $genome = &FIG::genome_of($peg);
718 :     </pre
719 :     to get the genome of the feature (<i>$fig->genome_of($peg)</i> would
720 :     work, as well), and then using
721 :     <pre>
722 :     ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
723 :     $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
724 :     </pre>
725 :     to extract the set of genes that overlap the region specified by the
726 :     four arguments <i>$genome, $contig, $start_reg, </i>and
727 :     <i>$end_reg</i>. A list of three things is returned by the
728 :     invocation:
729 :     <ol>
730 :     <li>
731 :     <i>$features_in_region</i> will be set to a pointer to a list of the
732 :     features (represented by FIG IDs) that overlap the specified region.
733 :     <li>
734 :     <i>$start_of_leftmost_feature</i> will be set to the "leftmost"
735 :     position occupied by one of the features that is returned.
736 :    
737 :     <i>$end_of_rightmost_feature</i> will be set to the "rightmost"
738 :     position occupied by one of the features that is returned.
739 :     </ol>
740 :     Thus, if you were writing a graphics module to display the features,
741 :     you would need to show the region from
742 :     <i>$start_of_leftmost_feature</i> to <i>$end_of_rightmost_feature</i>
743 :     if you wished to capture the features without truncation.
744 :     <li>
745 :     Once the features have been acquired, the program has a loop that
746 :     simply extracts data about each feature and prints it.
747 :     <pre>
748 :     $loc1 = $fig->feature_location($fid);
749 :     $aliases1 = $fig->feature_aliases($fid);
750 :     $trunc = $fig->possibly_truncated($fid);
751 :     </pre>
752 :     shows how to get the location of a feature, a comma-separated list of
753 :     aliases, and an indication of whether or not the feature might be
754 :     truncated (this is a rather simple check to just see if the feature
755 :     occurs within a few hundred bases from the end of the contig).
756 :     The use of
757 :     <pre>
758 :     &FIG::ftype($fid)
759 :     </pre>
760 :     just extracts the type of the feature (which is almost always
761 :     <i>peg</i> or <i>rna</i>, although we certainly hope that we will have
762 :     a richer set of feature types in the not too distant future).
763 :     The two invocations
764 :     <pre>
765 :     $prot_ln = $fig->translation_length($fid)
766 :     $pseq = $fig->get_translation($fid)
767 :     </pre>
768 :     show how to get the length of the protein sequence encoded by a PEG
769 :     and the actual sequence itself.
770 :     Finally,
771 :     <pre>
772 :     $func = $fig->function_of($fid)
773 :     </pre>
774 :     is used to get the current master function assignment for the gene.
775 :     The SEED does support non-master assignments, although their use has
776 :     been discouraged. the original intent was to allow inexperienced
777 :     users to make assignments that did not overwrite the "master"
778 :     assignment. Users who identify themselves as <i>master:someone</i>
779 :     will only make assignments that overwrite the master assignment.
780 :     However, a user without the "master:" prefix can make an assignment
781 :     that is retained seprately. To access such an assignment, you would
782 :     use
783 :     <pre>
784 :     $func = $fig->function_of($fid,"someone")
785 :     </pre>
786 :     If <i>someone</i> has not made an assignment to <i>$fig</i>, this will
787 :     return the master assignment.
788 :     </ul>
789 :    
790 :     The SEED comes with a directory containing the example programs we are
791 :     displaying. If you find it hard to follow my comments or feel that I
792 :     have failed left out some detail that you need to understand, please
793 :     <ol>
794 :     <li>
795 :     contact me (Ross@TheFIG.info), and
796 :     <li>
797 :     copy the program and play with it. The most basic form of playing is
798 :     to just do things like inserting
799 :     <pre>
800 :     print &Dumper($some_returned_variable);
801 :     </pre>
802 :     to see precisely what comes back.
803 :     </ol>
804 :    
805 :     <h2>Sequence Data</h2>
806 :    
807 :     You have already been introduced to sequence data. You know how to
808 :     access the DNA from a location in a contig from a designated genome,
809 :     you know you to get locations of features, and you know how to acquire
810 :     the translations for PEGs. In this short section we discuss just a
811 :     few utilities that you might find useful.
812 :     <p>
813 :     The next short example illustrates how to access the reverse
814 :     complement of a DNA sequence (i.e., the opposite strand), how to
815 :     translate DNA sequences, and how to write out sequences in FASTA
816 :     format.
817 :     Let us look at it before thinking about additional complexities like
818 :     how to handle nonstandard translations or properly translating the
819 :     first codon:
820 :    
821 : olson 1.2 <p>
822 :     <blockquote>
823 :     <table border="1">
824 :     <tr>
825 :     <td>seq_utils.pl</td>
826 :     <td>seq_utils.py</td>
827 :     </tr>
828 :     <tr>
829 :     <td bgcolor="f0f0f0"><pre>
830 :    
831 : overbeek 1.1 use FIG;
832 :     my $fig = new FIG;
833 :    
834 : olson 1.2 # seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
835 :    
836 : overbeek 1.1 $usage = "usage: seq_utils Genome Location";
837 :    
838 :     (
839 :     ($genome = shift @ARGV) &&
840 :     ($loc = shift @ARGV)
841 :     )
842 :     || die $usage;
843 :    
844 :     if ($seq = $fig->dna_seq($genome,$loc))
845 :     {
846 :     $seqR = &FIG::reverse_comp($seq);
847 :    
848 :     $tran = &FIG::translate($seq);
849 :     $tranR = &FIG::translate($seqR);
850 :    
851 :     &FIG::display_id_and_seq("plus_strand",\$seq);
852 :     &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
853 :    
854 :     &FIG::display_id_and_seq("minus_strand",\$seqR);
855 :     &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
856 :     }
857 :    
858 : olson 1.2 </pre></td>
859 :     <td bgcolor="f0f0f0"><pre>
860 :     from FigKernelPackages import FIG
861 :     import sys
862 :    
863 :     fig = FIG.FIG()
864 :    
865 :     def die(msg):
866 :     print msg
867 :     sys.exit(0)
868 :    
869 :     if len(sys.argv) != 3:
870 :     die("usage: seq_utils Genome Location")
871 :    
872 :     genome = sys.argv[1]
873 :     loc = sys.argv[2]
874 :    
875 :     seq = fig.dna_seq(genome, loc)
876 :    
877 :     if seq:
878 :     seqR = fig.reverse_comp(seq);
879 :    
880 :     tran = fig.translate(seq);
881 :     tranR = fig.translate(seqR);
882 :    
883 :     fig.display_id_and_seq("plus_strand",seq);
884 :     #fig.display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
885 :    
886 :     fig.display_id_and_seq("minus_strand",seqR);
887 :     #fig.display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
888 :     </pre></td>
889 :     </tr>
890 :     </table>
891 :     </blockquote>
892 :    
893 : overbeek 1.1 <h3>Using some simple sequence utilities</h3>
894 :     <hr>
895 :     The program takes two command-line arguments: a
896 :     genome ID and a location.
897 :     The program is supposed to write out the DNA from both the plus and
898 :     minus strands from the given location to STDOUT, and it writes the
899 :     translations of both strands to STDERR.
900 :     Thus,
901 :     <pre>
902 :     seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
903 :     </pre>
904 :     will write the following contents to the file <i>stdout</i>:
905 :     <pre>
906 :     >plus_strand
907 :     ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG
908 :     GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT
909 :     GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA
910 :     TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA
911 :     CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA
912 :     GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG
913 :     GCTGGTGCAACTGTCACAGTGAAA
914 :     >minus_strand
915 :     TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC
916 :     TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT
917 :     CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG
918 :     CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC
919 :     AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC
920 :     TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT
921 :     AATTTGATTTGTAATAGTAGACAT
922 :     </pre>
923 :     and the following two sequences to the file <i>stderr</i>
924 :     <pre>
925 :     >plus_strand_translated
926 :     MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE
927 :     FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE
928 :     AGATVTVK
929 :     >minus_strand_translated
930 :     FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW
931 :     HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN
932 :     NLICNSRH
933 :     </pre>
934 :     Now let us look at the routines within the program that produced these
935 :     results:
936 :     <ul>
937 :     <li>
938 :     You have already seen the use of something like
939 :     <pre>
940 :     $seq = $fig->dna_seq($genome,$loc)
941 :     </pre>
942 :     to extract the DNA. The actual location can specify sequence from
943 :     either strand (remember how?).
944 :     <li>
945 :     The we use
946 :     <pre>
947 :     $seqR = &FIG::reverse_comp($seq)
948 :     </pre>
949 :     to get the reverse complement of the sequence held in <i>$seq</i>.
950 :     <li>
951 :     The program uses
952 :     <pre>
953 :     $tran = &FIG::translate($seq);
954 :     $tranR = &FIG::translate($seqR);
955 :     </pre>
956 :     to get translations of both strands. Note that these invocations use
957 :     the standard genetic code to do the translation. The <i>translate</i>
958 :     service supports two optional arguments. The second argument is used
959 :     to pass a pointer to a hash containing the desired translation, if you wish to use
960 :     a nonstandard code. The keys of the hash should be
961 :    
962 :     <pre>
963 :     AAA =>
964 :     AAC =>
965 :     AAG =>
966 :     .
967 :     .
968 :     .
969 :     TTT =>
970 :     </pre>
971 :     The values are what the codons are translated to (normally, from the
972 :     set {A,C,G,T}, but it is essentially unrestricted). If the third
973 :     argument evaluates to <i>true</i> (i.e., defined, not null string, and
974 :     not 0), then initial GTG and TTG codons will be translated to "M".
975 :     <li>
976 :     The final little utility we employ is used to write the FASTA files:
977 :     <pre>
978 :     &FIG::display_id_and_seq("plus_strand",\$seq);
979 :     &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
980 :    
981 :     &FIG::display_id_and_seq("minus_strand",\$seqR);
982 :     &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
983 :     </pre>
984 :     The first argument is the <i>id</i> you wish, the second is a pointer
985 :     to the sequence to write out, and the third argument (which is
986 :     optional) is a file handle. It defaults to writing to STDOUT.
987 :    
988 :     <h2>Accessing Similarities</h2>
989 :    
990 :     A great deal of comparative analysis relates to use of detected
991 :     similarities. The SEED includes precomputed similarities for each
992 :     protein sequence. Basically these similarities are computed as
993 :     follows:
994 :     <ol>
995 :     <li>
996 :     First, a nonredundant protein sequence data base is formed by taking
997 :     all protein sequences from a number of sources, including the
998 :     organisms loaded into the SEED (but, including sequences from external
999 :     sources that are not found in the SEED organisms). Sets of protein
1000 :     sequences that differ only in start calls are collapsed into single
1001 :     entries in the nonredundant database (the longer sequences are
1002 :     retained, and a table of synonyms is constructed showing which IDs
1003 :     correspond to each nonredundant entry, and which of the merged IDs
1004 :     actually represent shorter sequences.
1005 :     <li>
1006 :     Then all of the sequences in the nonredundant database are blasted
1007 :     against one another. For each sequence, there may be only a few or a
1008 :     great many similarities. We keep only the best similarity between any
1009 :     two sequences. Further, we set a maximum number of similarities
1010 :     retained for each sequence to "about 300". We try to keep a set that
1011 :     includes at least the best similarity between the given sequence and
1012 :     sequences from any specific organism. We do not guarantee that we even
1013 :     succeed in that. So, you should think of the precomputed similarities
1014 :     as a potentially useful, but quite truncated, set. We are moving
1015 :     towards a situation in which relationships between closely related sequences are saved in
1016 :     the form of trees (and the corresponding similarities are discarded).
1017 :     <li>
1018 :     When a user requests the similarities to a given PEG, first the SEED
1019 :     must figure out which entry in the nonredundant protein sequence
1020 :     collection corresponds to the PEG (it might actually be the
1021 :     translation corresponding to the PEG, or it might be another sequence
1022 :     ID corresponding to a longer sequence from, say, Swiss Prot). Then,
1023 :     the block of similarities corresponding to the representative ID is
1024 :     extracted. Each of the similarities in this set really represents a
1025 :     collection of similarities (one corresponding to each ID in the set of
1026 :     sequences that were collapsed into a single representative ID). For
1027 :     each of the "raw" similarities, you can choose to expand it into the
1028 :     entire set it represents or not, as you wish. The expansion actually
1029 :     takes much more time than actually extracting the block of
1030 :     similarities, which is an irritating fact (which may be correctable by
1031 :     some bright young computer scientists, but for now...). So, it is
1032 :     important that you understand the two concepts:
1033 :     <ul>
1034 :     <li>
1035 :     A <i>raw similarity</i> is a similarity between two members of the
1036 :     nonredundant protein collection.
1037 :     <li>
1038 :     The act of <i>expanding a similarity</i> amounts to taking a
1039 :     similarity between the given ID (ususally called <i>id1</i>) and
1040 :     another (usually called <i>id2</i>) and expanding it by considering
1041 :     all ids that were collapsed to <i>id2</i>.
1042 :     </ul>
1043 :     <li>
1044 :     Finally, you need to know that when you request similarities against a
1045 :     given id, you get to specify how many of the similarities you wish
1046 :     expanded. This is useful, since one might wish to see the expanded
1047 :     set for, say, the closest five, while getting several hundred
1048 :     unexpanded similarities.
1049 :     </ol>
1050 :     <br>
1051 :     Here is a short program that illustrates some of these comments. It
1052 :     takes as input arguments
1053 :     <ol>
1054 :     <li>
1055 :     a given PEG id,
1056 :     <li>
1057 :     a similarity threshhold (minimum p-score),
1058 :     <li>
1059 :     the maximum number of similarities to be returned, and
1060 :     <li>
1061 :     the number of similarities to expand.
1062 :     </ol>
1063 :     For each similarity, it displays the
1064 :     <ul>
1065 :     <li> the length of the given PEG,
1066 :     <li> id2 (the id of the similar sequence),
1067 :     <li> the length of id2,
1068 :     <li> the p-score,
1069 :     <li> the perecent identity,
1070 :     <li> the region of similarity, and
1071 :     <li> the function assigned to the similar sequence.
1072 :     </ul>
1073 :    
1074 : olson 1.2 <p>
1075 :     <blockquote>
1076 :     <table border="1">
1077 :     <tr>
1078 :     <td>show_similarities.pl</td>
1079 :     <td>show_similarities.py</td>
1080 :     </tr>
1081 :     <tr>
1082 :     <td bgcolor="f0f0f0"><pre>
1083 : overbeek 1.1 use FIG;
1084 :     my $fig = new FIG;
1085 :     use Sim;
1086 :    
1087 :     use strict;
1088 :    
1089 :     my($usage,$peg,$cutoff,$max,$expand);
1090 :     my(@sims,$sim,$id2,$pscore,$iden,$b1,$e1,$b2,$e2,$ln1,$ln2,$func1,$func2);
1091 :    
1092 :     $usage = "usage: show_similarities PEG CutOff MaxReturned Expand";
1093 :    
1094 :     (
1095 :     ($peg = shift @ARGV) &&
1096 :     ($cutoff = shift @ARGV) &&
1097 :     ($max = shift @ARGV) &&
1098 :     defined($expand = shift @ARGV)
1099 :     )
1100 :     || die $usage;
1101 :    
1102 :     @sims = $fig->sims($peg,$max,$cutoff,"all",$expand);
1103 :     foreach $sim (@sims)
1104 :     {
1105 :     $ln1 = $sim->ln1;
1106 :     $id2 = $sim->id2;
1107 :     $ln2 = $sim->ln2;
1108 :     $pscore = $sim->psc;
1109 :     $iden = $sim->iden;
1110 :     $b1 = $sim->b1;
1111 :     $e1 = $sim->e1;
1112 :     $b2 = $sim->b2;
1113 :     $e2 = $sim->e2;
1114 :     $func2 = $fig->function_of($id2);
1115 :     print join("\t",($ln1,$id2,$ln2,$pscore,$iden,$b1,$e1,$b2,$e2,$func2)),"\n";
1116 :     }
1117 : olson 1.2 </pre></td>
1118 :     <td bgcolor="f0f0f0"><pre>
1119 :     from FigKernelPackages import FIG
1120 :     import sys, string
1121 :    
1122 :     fig = FIG.FIG()
1123 :     id1_ix = 0
1124 :     ln1_ix = 12
1125 :     id2_ix = 1
1126 :     ln2_ix = 13
1127 :     iden_ix = 2
1128 :     psc_ix = 10
1129 :     ali_ln_ix = 3
1130 :     mismatches_ix = 4
1131 :     gaps_ix = 5
1132 :     bsc_ix = 11
1133 :     b1_ix = 6
1134 :     e1_ix = 7
1135 :     b2_ix = 8
1136 :     e2_ix = 9
1137 :     bit_score_ix = 11
1138 :     tool_ix = 14
1139 :     def2_ix = 15
1140 :     ali_ix = 16
1141 :    
1142 :    
1143 :     def die(msg):
1144 :     print msg
1145 :     sys.exit(0)
1146 :    
1147 :     if len(sys.argv) != 5:
1148 :     die("usage: show_similarities PEG CutOFF MaxReturned Expand")
1149 :    
1150 :    
1151 :     peg = sys.argv[1]
1152 :     cutoff = sys.argv[2]
1153 :     max = sys.argv[3]
1154 :     expand = sys.argv[4]
1155 :    
1156 :     sims = fig.sims(peg, max, cutoff, "all", expand)
1157 :    
1158 :     for sim in sims:
1159 :     ln1 = str(sim[ln1_ix])
1160 :     id2 = str(sim[id2_ix])
1161 :     ln2 = str(sim[ln2_ix])
1162 :     pscore = str(sim[psc_ix])
1163 :     iden = str(sim[iden_ix])
1164 :     b1 = str(sim[b1_ix])
1165 :     e1 = str(sim[e1_ix])
1166 :     b2 = str(sim[b2_ix])
1167 :     e2 = str(sim[e2_ix])
1168 :     func2 = fig.function_of(id2)
1169 :     print string.join((ln1, id2, ln2, pscore, iden, b1, e1, b2, e2, func2[1]), "\t")
1170 :     </pre></td>
1171 :     </tr>
1172 :     </table>
1173 :     </blockquote>
1174 : overbeek 1.1
1175 :     <h3>How to Access Similarities</h3>
1176 :     <hr>
1177 :     If youn were to run this program using
1178 :     <pre>
1179 :     show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 0
1180 :     </pre>
1181 :     it would produce an output file that started with the following lines:
1182 :     <pre>
1183 :     310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1184 :     310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1185 :     310 pirnr|NF01089494 310 2e-177 99.35 1 310 1 310 Homoserine kinase
1186 :     310 pirnr|NF01137671 310 2e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK)
1187 :     310 fig|216599.1.peg.3713 310 3e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1188 :     </pre>
1189 :     This output amounts to the first five unexpanded similarities. If you
1190 :     were to run
1191 :     <pre>
1192 :     show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 2
1193 :     </pre>
1194 :     it would produce an output file that started with the following lines:
1195 :     <pre>
1196 :     310 sp|P00547 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1197 :     310 pirnr|NF00702903 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK)
1198 :     310 kegg|eco:b0003 310 0 100 1 310 1 310 homoserine kinase (HK) [EC:2.7.1.39]
1199 :     310 kegg|ecj:JW0002 310 0 100 1 310 1 310 Homoserine kinase [EC:2.7.1.39]
1200 :     310 pirnr|NF01368612 310 0 100 1 310 -18 291 homoserine kinase
1201 :     310 gi|33383907 310 0 100 1 310 -18 291 homoserine kinase
1202 :     310 gi|16127997 310 0 100 1 310 1 310 homoserine kinase
1203 :     310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1204 :     310 fig|216593.1.peg.3578 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1205 :     310 fig|216598.1.peg.2077 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1206 :     310 gi|10186209 310 4e-178 99.68 1 310 -21 288 homoserine kinase
1207 :     310 gi|33383857 310 4e-178 99.68 1 310 -18 291 homoserine kinase
1208 :     310 pirnr|NF00706256 310 4e-178 99.68 1 310 -21 288 Homoserine kinase (Fragment)
1209 :     310 pirnr|NF01368414 310 4e-178 99.68 1 310 -18 291 homoserine kinase
1210 :     310 tr|Q9ETA5 310 4e-178 99.68 1 310 -21 288 Homoserine kinase
1211 :     310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1212 :     310 fig|155864.1.peg.3 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
1213 :     310 kegg|ece:Z0003 310 9e-178 99.35 1 310 1 310 homoserine kinase [EC:2.7.1.39]
1214 :     </pre>
1215 :     The first seven lines of this expanded set of similarities amount to
1216 :     identical matches against the set of sequences that were collapsed
1217 :     into a single entry in the nonredundant database (i.e., sequences that
1218 :     were essentially identical with <i>fig|83333.1.peg.3</i>). The next
1219 :     eight similarities correspond to the first "raw" similarity, and so
1220 :     forth.
1221 :     <br>
1222 :     Now let us examine the program:
1223 :     <ul>
1224 :     <li>Use the <i>Sim.pm</i> module, if you are processing similarities.
1225 :     That is, include
1226 :     <pre>
1227 :     use Sim;
1228 :     </pre>
1229 :     at the head of your program.
1230 :    
1231 :     <li>
1232 :     The invocation
1233 :     <pre>
1234 :     $fig->sims($peg,$max,$cutoff,"all",$expand)
1235 :     </pre>
1236 :     takes several parameters (the given PEG, the maximum number of
1237 :     similarities to return, the p-score cutoff, a "selection" option, and
1238 :     the number of raw similarities to expand).
1239 :     Itb returns a list of objects of type <i>similarity</i>.
1240 :     The "selection" option can
1241 :     be
1242 :     <ul>
1243 :     <li><i>raw</i> to get just raw similarities (equivalent to expanding
1244 :     0)
1245 :     <li><i>all</i> returns all similarities,
1246 :     <li><i>fig</i> returns only similarities against FIG ids,
1247 :     <li><i>external</i> returns only similarities against ids that are not
1248 :     FIG ids,
1249 :     </ul>
1250 :     To understand thye impact of the parameters, I suggest that you take
1251 :     the simple program and run it with altered parameters, or just use the
1252 :     SEED web services to see what happens.
1253 :    
1254 :    
1255 :     <li>
1256 :     To extract values relating to the similarity, you would use
1257 :     assignments as shown:
1258 :     <pre>
1259 :     $ln1 = $sim->ln1;
1260 :     $id2 = $sim->id2;
1261 :     $ln2 = $sim->ln2;
1262 :     $pscore = $sim->psc;
1263 :     $iden = $sim->iden;
1264 :     $b1 = $sim->b1;
1265 :     $e1 = $sim->e1;
1266 :     $b2 = $sim->b2;
1267 :     $e2 = $sim->e2;
1268 :     </pre>
1269 :     </ul>
1270 :    
1271 :     There is a great deal more that could be said about similarities, but
1272 :     for the purposes of this tutorial, that is it.
1273 :    
1274 :     <h2>Functional Coupling</h2>
1275 :    
1276 :     When I say that two genes are "functionally coupled", I mean that they
1277 :     play closely related roles. They may be multiple subunits of the same
1278 :     complex, they may be enzymes from the same pathway, they may be
1279 :     co-regulated, or whatever. Functional coupling can be inferred using
1280 :     many different technologies:
1281 :     <ul>
1282 :     <li>detection of preserved contiguity on chromosomes,
1283 :     <li>analysis of gene fusion event,
1284 :     <li>detection of similar occurrence profiles,
1285 :     <li>creation of protein-protein interaction data,
1286 :     <li>detection of common transcription regulatory sites, and
1287 :     <li>detection of common expression profiles
1288 :     </ul>
1289 :     While the SEED will eventually support all of these techniques (and,
1290 :     now does support the first three), support for detection of preserved
1291 :     contiguity is the only one that is built into the kernel.
1292 :     Hence, this discussion will focus on detection of functional coupling
1293 :     via identification of preserved contiguity of genes.
1294 :     <p>
1295 :     The SEED supports one central role that can be used to summarize
1296 :     preserved contiguity:
1297 :     The following short program illustrates its use:
1298 :    
1299 : olson 1.2 <p>
1300 :     <blockquote>
1301 :     <table border="1">
1302 :     <tr>
1303 :     <td>show_functional_coupling.pl</td>
1304 :     <td>show_functional_coupling.py</td>
1305 :     </tr>
1306 :     <tr>
1307 :     <td bgcolor="f0f0f0"><pre>
1308 : overbeek 1.1 use FIG;
1309 :     my $fig = new FIG;
1310 :     use strict;
1311 :    
1312 :     my($usage,$bound,$sim_cutoff,$fc_cutoff,@fc,$tuple,$peg);
1313 :    
1314 : olson 1.2 $usage = "usage: show_functional_coupling Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs";
1315 : overbeek 1.1
1316 :     (
1317 :     ($bound = shift @ARGV) &&
1318 :     ($sim_cutoff = shift @ARGV) &&
1319 :     ($fc_cutoff = shift @ARGV)
1320 :     )
1321 :     || die $usage;
1322 :    
1323 :     while (defined($_ = <STDIN>))
1324 :     {
1325 :     if ($_ =~ /^(fig\|\d+\.\d+\.peg\.\d+)/)
1326 :     {
1327 : olson 1.2 $peg = $1;
1328 :     @fc = $fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
1329 :     foreach $tuple (@fc)
1330 :     {
1331 :     # print &Dumper($tuple);
1332 :     print "$peg\t$tuple->[0]\t$tuple->[1]\n";
1333 :     }
1334 : overbeek 1.1 }
1335 :     }
1336 : olson 1.2 </pre></td>
1337 :     <td bgcolor="f0f0f0"><pre>
1338 :     from FigKernelPackages import FIG2
1339 :     import sys, re
1340 :    
1341 :     fig = FIG2.FIG()
1342 :    
1343 :     def die(msg):
1344 :     print msg
1345 :     sys.exit(0)
1346 :    
1347 :     if len(sys.argv) != 4:
1348 :     die("usage: fc Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs")
1349 :    
1350 :    
1351 :     bound = sys.argv[1]
1352 :     sim_cutoff = sys.argv[2]
1353 :     fc_cutoff = sys.argv[3]
1354 :    
1355 :     for line in sys.stdin.readlines():
1356 :     match = re.search("^(fig\|\d+\.\d+\.peg\.\d+)", line)
1357 :     if match:
1358 :     peg = match.group(1)
1359 :     fc = fig.coupling_and_evidence(peg, bound, sim_cutoff, fc_cutoff)
1360 :     for tuple in fc:
1361 :     print "%s\t%s\t%s" % ( peg, tuple[0], tuple[1])
1362 :     </pre></td>
1363 :     </tr>
1364 :     </table>
1365 :     </blockquote>
1366 : overbeek 1.1
1367 :     <h3>How to compute functional coupling scores based on preserved contiguity</h3>
1368 :     <hr>
1369 :     The short program simply reads in a file containing PEG ids. For each
1370 :     PEG, it invokes
1371 :     <pre>
1372 :     @fc =$fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
1373 :     </pre>
1374 :    
1375 :     to compute a list of 3-tuples. Each 3-tuple corresponds to a
1376 :     functionally coupled gene. The 3-tuple contains
1377 :     <ol>
1378 :     <li> the number of occurrences of preservation of homologous pairs
1379 :     that were detected in somewhat diverged genomes,
1380 :     <li> the PEG id of a functionally coupled gene, and
1381 :     <li> a pointer to a list of 2-tuples that describe the pairs in
1382 :     different genomes.
1383 :     </ol>
1384 :     If the program is run using the following command
1385 :     <pre>
1386 :     show_functional_coupling 5000 1.0e-10 2 < tmp
1387 :     </pre>
1388 :     and if <i>tmp</i> contains a single line of the form
1389 :     <pre>
1390 :     fig|83333.1.peg.3
1391 :     </pre>
1392 :     then the program will produce the following output:
1393 :     <pre>
1394 :     $VAR1 = [
1395 :     77,
1396 :     'fig|83333.1.peg.2',
1397 :     [
1398 :     [
1399 :     'fig|263.1.peg.1622',
1400 :     'fig|263.1.peg.1624'
1401 :     ],
1402 :     [
1403 :     'fig|263.1.peg.1622',
1404 :     'fig|263.1.peg.1625'
1405 :     ],
1406 :     [
1407 :     'fig|562.2.peg.2720',
1408 :     'fig|562.2.peg.2721'
1409 :     ],
1410 :     [
1411 :     'fig|573.2.peg.4598',
1412 :     'fig|573.2.peg.4594'
1413 :     ],
1414 :     .
1415 :     . <<< deleted lines showing more pairs >>>
1416 :     .
1417 :     ]
1418 :     ];
1419 :     $VAR1 = [
1420 :     44,
1421 :     'fig|83333.1.peg.4',
1422 :     [
1423 :     [
1424 :     'fig|263.1.peg.1622',
1425 :     'fig|263.1.peg.1620'
1426 :     ],
1427 :     [
1428 :     'fig|562.2.peg.2526',
1429 :     'fig|562.2.peg.2527'
1430 :     ],
1431 :     [
1432 :     'fig|562.2.peg.2521',
1433 :     'fig|562.2.peg.2522'
1434 :     ],
1435 :     [
1436 :     'fig|562.2.peg.2525',
1437 :     'fig|562.2.peg.2527'
1438 :     ],
1439 :     [
1440 :     'fig|573.2.peg.4598',
1441 :     'fig|573.2.peg.4599'
1442 :     ],
1443 :     .
1444 :     . <<< deleted lines showing more pairs >>>
1445 :     .
1446 :     ]
1447 :     ];
1448 :     $VAR1 = [
1449 :     10,
1450 :     'fig|83333.1.peg.6',
1451 :     [
1452 :     [
1453 :     'fig|562.2.peg.2526',
1454 :     'fig|562.2.peg.2529'
1455 :     ],
1456 :     [
1457 :     'fig|562.2.peg.2525',
1458 :     'fig|562.2.peg.2529'
1459 :     ],
1460 :     .
1461 :     . <<< deleted lines showing more pairs >>>
1462 :     .
1463 :     ]
1464 :     ];
1465 :     $VAR1 = [
1466 :     7,
1467 :     'fig|83333.1.peg.7',
1468 :     [
1469 :     [
1470 :     'fig|562.2.peg.2526',
1471 :     'fig|562.2.peg.2531'
1472 :     ],
1473 :     [
1474 :     'fig|562.2.peg.2526',
1475 :     'fig|562.2.peg.2530'
1476 :     ],
1477 :     [
1478 :     'fig|562.2.peg.2525',
1479 :     'fig|562.2.peg.2531'
1480 :     ],
1481 :     [
1482 :     'fig|562.2.peg.2525',
1483 :     'fig|562.2.peg.2530'
1484 :     ],
1485 :     [
1486 :     'fig|573.2.peg.4598',
1487 :     'fig|573.2.peg.4604'
1488 :     ],
1489 :     .
1490 :     . <<< deleted lines showing more pairs >>>
1491 :     .
1492 :     ]
1493 :     ];
1494 :     $VAR1 = [
1495 :     7,
1496 :     'fig|83333.1.peg.8',
1497 :     [
1498 :     [
1499 :     'fig|594.1.peg.4954',
1500 :     'fig|594.1.peg.4958'
1501 :     ],
1502 :     [
1503 :     'fig|630.2.peg.581',
1504 :     'fig|630.2.peg.584'
1505 :     ],
1506 :     [
1507 :     'fig|12149.1.peg.2',
1508 :     'fig|12149.1.peg.6'
1509 :     ],
1510 :     .
1511 :     . <<< deleted lines showing more pairs >>>
1512 :     .
1513 :     ]
1514 :     ];
1515 :     </pre>
1516 :     If you simply replaced
1517 :     <pre>
1518 :     print &Dumper($tuple);
1519 :     </pre>
1520 :     with
1521 :     <pre>
1522 :     print "$peg\t$tuple->[0]\t$tuple->[1]\n";
1523 :     </pre>
1524 :     you would get
1525 :     <pre>
1526 :     fig|83333.1.peg.3 77 fig|83333.1.peg.2
1527 :     fig|83333.1.peg.3 44 fig|83333.1.peg.4
1528 :     fig|83333.1.peg.3 10 fig|83333.1.peg.6
1529 :     fig|83333.1.peg.3 7 fig|83333.1.peg.7
1530 :     fig|83333.1.peg.3 7 fig|83333.1.peg.8
1531 :    
1532 :     </pre>
1533 :     which might be closer to what most end users wish to see.
1534 :     <p>
1535 :     The only detail that might be worth noting is the scoring function.
1536 :     It counts the number of pairs (besides the pair composed of the given
1537 :     PEG and the related PEG) that could be located, where "essentially
1538 :     identical" occurrences are collapsed to a single occurrence. For
1539 :     example, if you had 10 <i>E.coli</i> genomes (of closely related
1540 :     strains), you would wish to count them as a single occurrence.
1541 :     Basically, we collapse occurrences between genes that are encode
1542 :     protein sequences that are more than 97% identical to a single
1543 :     occurrence. So, when you see
1544 :     <pre>
1545 :     fig|83333.1.peg.3 77 fig|83333.1.peg.2
1546 :     </pre>
1547 :     This means that there are genes encoding proteins similar to those
1548 :     encoded by fig|83333.1.peg.3 and fig|83333.1.peg.2 that are within
1549 :     5000 bases of one another in 77 quite distinct genomes.
1550 :    
1551 :     <h2>Maps, Subsystems, Functional Roles, and gene Functions</h2>
1552 :    
1553 :     The SEED currently supports four distinct notions that I attempt to
1554 :     differentiate in ths section. My working definitions would be as
1555 :     follows:
1556 :     <ol>
1557 :     <li>
1558 :     A <b>map</b> is a relatively large set of related functional roles.
1559 :     The notion grew out of working with KEGG maps. It encompasses larger
1560 :     functional units than the related notion <b>pathway</b>.
1561 :     <li>The notion of <b>subsystem</b> is a slight extension of the
1562 :     concept <b>pathway</b>. It is a set of functional roles that together
1563 :     play a related function. A pathway is a metabolic subsystem. A
1564 :     ribosome is a nonmetabolic subsystem.
1565 :     <p>
1566 :     Since KEGG is doing such a useful job in attempting to curate maps, we
1567 :     download their maps and use them. We supplement them occasionally
1568 :     with some drawn by ourselves, but we do think that the community owes
1569 :     KEGG real gratitude for making these maps available.
1570 :    
1571 :     <li>My notion of <b>functional role</b> is roughly an abstraction of
1572 :     the function of a gene (i.e., a monfunctional gene). It is a
1573 :     well-defined component of a subsystem. Variants of subsystems should
1574 :     be thought of as sets of functional roles (i.e., not as sets of
1575 :     genes).
1576 :    
1577 :     <li>A <b>gene function</b> may be thought of as a set of functional
1578 :     roles. Often there is just one, but many genes are multifunctional.
1579 :     Somestimes this means that a single domain of the protein encoded by
1580 :     the gene has low specificity which allows the gene to play multiple
1581 :     roles, and sometimes the gene has several domains with each domain
1582 :     implementing a complete functional role. Either way, if a gene
1583 :     function implements multiple roles, it should have the form
1584 :     <pre>
1585 :     role1 / role2 / role3...
1586 :     </pre>
1587 :     That is, the gene function should be made up of a list of roles
1588 :     separated by " / ".
1589 :     A gene function is said to <i>connect</i> to each of the roles that
1590 :     make it up (those roles that are separated by " / "), as well as any
1591 :     embedded EC numbers. Thus, the gene function
1592 :     <pre>
1593 :     Pyruvate kinase (EC 2.7.1.40) / 6-phosphofructokinase (EC 2.7.1.11)
1594 :     </pre>
1595 :     would connect to
1596 :     <ol>
1597 :     <li>Pyruvate kinase (EC 2.7.1.40),
1598 :     <li>6-phosphofructokinase (EC 2.7.1.11),
1599 :     <li>2.7.1.40, and
1600 :     <li>2.7.1.11
1601 :     </ol>
1602 :     </ol>
1603 :     <p>
1604 :     Normally, when we speak of maps, most of the functional roles are
1605 :     simply EC numbers (although some are not). When we speak of
1606 :     subsystems, the functional roles are almost never specified as just an
1607 :     EC number.
1608 :     <p>
1609 :     Our first simple example program takes as command line arguments a map
1610 :     ID (i.e., one of the KEGG map numbers -- these can be taken off the
1611 :     initial page displayed by the SEED web services) and a genome ID. It
1612 :     displays the map ID and the "map name". Then, it lists the roles
1613 :     contained within the map, and for each role it displays the PEGs
1614 :     associated with the role. Thus,
1615 :    
1616 :     <pre>
1617 :     connect_map_to_genes MAP00010 83333.1
1618 :     </pre>
1619 :     produced the following output for me:
1620 :     <pre>
1621 :     MAP00010 Glycolysis / Gluconeogenesis
1622 :     2.7.1.11
1623 :     fig|83333.1.peg.1706 6-phosphofructokinase II (EC 2.7.1.11)
1624 :     fig|83333.1.peg.3835 6-phosphofructokinase (EC 2.7.1.11)
1625 :     2.7.1.41
1626 :     1.2.1.3
1627 :     fig|83333.1.peg.3522 Aldehyde dehydrogenase (EC 1.2.1.3)
1628 :     5.4.2.1
1629 :     fig|83333.1.peg.3548 2,3-bisphosphoglycerate-independent phosphoglycerate mutase (EC 5.4.2.1)
1630 :     fig|83333.1.peg.4303 Phosphoglycerate mutase (EC 5.4.2.1)
1631 :     fig|83333.1.peg.741 Phosphoglycerate mutase (EC 5.4.2.1)
1632 :     3.1.3.10
1633 :     fig|83333.1.peg.989 Glucose-1-phosphatase precursor (EC 3.1.3.10)
1634 :     1.1.1.27
1635 :     fig|83333.1.peg.787 L-lactate dehydrogenase (EC 1.1.1.27)
1636 :     1.8.1.4
1637 :     fig|83333.1.peg.116 Dihydrolipoamide dehydrogenase (EC 1.8.1.4)
1638 :     3.6.1.7
1639 :     fig|83333.1.peg.953 Acylphosphatase (EC 3.6.1.7)
1640 :     2.7.1.40
1641 :     fig|83333.1.peg.1660 Pyruvate kinase (EC 2.7.1.40)
1642 :     fig|83333.1.peg.1838 Pyruvate kinase (EC 2.7.1.40)
1643 :     2.7.1.1
1644 :     3.1.3.9
1645 :     4.1.1.1
1646 :     5.1.3.15
1647 :     1.1.99.8
1648 :     1.2.4.1
1649 :     fig|83333.1.peg.114 Pyruvate dehydrogenase E1 component (EC 1.2.4.1)
1650 :     5.1.3.3
1651 :     fig|83333.1.peg.742 Aldose 1-epimerase (EC 5.1.3.3)
1652 :     6.2.1.1
1653 :     fig|83333.1.peg.3979 Acetyl-coenzyme A synthetase (EC 6.2.1.1)
1654 :     5.4.2.4
1655 :     2.7.1.69
1656 :     fig|83333.1.peg.1086 PTS system, glucose-specific IIBC component (EIIBC-Glc) (Glucose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1657 :     fig|83333.1.peg.1607 PTS system, maltose and glucose-specific IIABC component (Maltose and glucose-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1658 :     fig|83333.1.peg.1719 PTS system, cellobiose-specific IIA component (EIIA-Cel) (Cellobiose- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1659 :     fig|83333.1.peg.1721 PTS system, cellobiose-specific IIB component (EIIB-Cel) (Cellobiose- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1660 :     fig|83333.1.peg.1800 PTS system, mannose-specific IIAB component (EC 2.7.1.69)
1661 :     fig|83333.1.peg.2068 PTS system, galactitol-specific IIB component (EIIB-GAT) (Galacticol- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1662 :     fig|83333.1.peg.2069 PTS system, galactitol-specific IIA component (EIIA-GAT) (Galacticol- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1663 :     fig|83333.1.peg.2142 PTS system, fructose-specific IIBC component (EIIBC-Fru) (Fructose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1664 :     fig|83333.1.peg.2144 PTS system, fructose-specific IIA/FPr component (EIIA-Fru) (Fructose- permease IIA/FPr component) (Phosphotransferase enzyme II, A/FPr component) (EC 2.7.1.69)
1665 :     fig|83333.1.peg.2165 PTS system, mannitol-specific IIBC component (EC 2.7.1.69)
1666 :     fig|83333.1.peg.2360 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1667 :     fig|83333.1.peg.2385 PTS system, glucose-specific IIA component (EC 2.7.1.69)
1668 :     fig|83333.1.peg.2657 Sorbitol PTS, EIIC (EC 2.7.1.69)
1669 :     fig|83333.1.peg.2658 PTS system, glucitol/sorbitol-specific IIBC component (EC 2.7.1.69)
1670 :     fig|83333.1.peg.2659 PTS system, glucitol/sorbitol-specific IIA component (EIIA-gut) (Glucitol/sorbitol-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1671 :     fig|83333.1.peg.2670 PTS system, arbutin-, cellobiose-, and salicin-specific IIABC component (EIIABC-ASC) (Arbutin-, cellobiose-, and salicin-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1672 :     fig|83333.1.peg.2884 PTS system, mannitol (Cryptic) -specific IIBC component (EIIBC-(C)MTL) (Mannitol (Cryptic)-permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1673 :     fig|83333.1.peg.2885 PTS system, mannitol (Cryptic) -specific IIA component (EIIA-(C)MTL) (Mannitol (Cryptic)-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1674 :     fig|83333.1.peg.3083 PTS system, N-acetylgalactosamine-specific IIB component 1 (EIIB-AGA) (N-acetylgalactosamine-permease IIB component 1) (Phosphotransferase enzyme II, B component 1) (EC 2.7.1.69)
1675 :     fig|83333.1.peg.3147 Nitrogen regulatory IIA protein (EC 2.7.1.69)
1676 :     fig|83333.1.peg.3534 PTS system, mannitol-specific IIABC component (EIIABC-MTL) (Mannitol- permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1677 :     fig|83333.1.peg.3659 PTS system, beta-glucoside-specific IIABC component (EIIABC-Bgl) (Beta-glucoside-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1678 :     fig|83333.1.peg.3820 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1679 :     fig|83333.1.peg.3821 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
1680 :     fig|83333.1.peg.3869 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1681 :     fig|83333.1.peg.3870 PTS system, fructose-like-2 IIB component 1 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1682 :     fig|83333.1.peg.3873 PTS system, fructose-like-2 IIB component 2 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1683 :     fig|83333.1.peg.4150 PTS system, trehalose-specific IIBC component (EIIBC-TRE) (Trehalose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1684 :     fig|83333.1.peg.4213 PTS system, galactitol-specific IIC component (EC 2.7.1.69)
1685 :     fig|83333.1.peg.669 PTS system, glucose-specific IIABC component (EC 2.7.1.69)
1686 :     fig|83333.1.peg.723 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
1687 :     1.1.1.71
1688 :     1.1.1.1
1689 :     fig|83333.1.peg.1228 Aldehyde-alcohol dehydrogenase [Includes: Alcohol dehydrogenase (EC 1.1.1.1) (ADH); Acetaldehyde dehydrogenase [acetylating] (EC 1.2.1.10) (ACDH); Pyruvate-formate-lyase deactivase (PFL deactivase)]
1690 :     fig|83333.1.peg.1301 Alcohol dehydrogenase (EC 1.1.1.1)
1691 :     fig|83333.1.peg.3523 Alcohol dehydrogenase (EC 1.1.1.1)
1692 :     fig|83333.1.peg.353 Alcohol dehydrogenase class III (EC 1.1.1.1)
1693 :     2.7.1.63
1694 :     3.2.1.86
1695 :     fig|83333.1.peg.1717 6-phospho-beta-glucosidase (EC 3.2.1.86)
1696 :     fig|83333.1.peg.2853 6-phospho-beta-glucosidase bglA (EC 3.2.1.86)
1697 :     fig|83333.1.peg.3658 6-phospho-beta-glucosidase bglB (EC 3.2.1.86)
1698 :     3.1.6.3
1699 :     1.2.1.12
1700 :     fig|83333.1.peg.1762 Glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12)
1701 :     2.7.2.3
1702 :     fig|83333.1.peg.2877 Phosphoglycerate kinase (EC 2.7.2.3)
1703 :     5.3.1.9
1704 :     fig|83333.1.peg.3934 Glucose-6-phosphate isomerase (EC 5.3.1.9)
1705 :     1.2.1.5
1706 :     1.2.1.51
1707 :     4.2.1.11
1708 :     fig|83333.1.peg.2735 Enolase (EC 4.2.1.11)
1709 :     5.3.1.1
1710 :     fig|83333.1.peg.3838 Triosephosphate isomerase (EC 5.3.1.1)
1711 :     3.1.3.13
1712 :     4.1.2.13
1713 :     fig|83333.1.peg.1504 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1714 :     fig|83333.1.peg.2072 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1715 :     fig|83333.1.peg.2876 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1716 :     2.7.1.2
1717 :     fig|83333.1.peg.2361 Glucokinase (EC 2.7.1.2)
1718 :     2.3.1.12
1719 :     fig|83333.1.peg.115 Dihydrolipoamide acetyltransferase component of pyruvate dehydrogenase complex (EC 2.3.1.12) (E2)
1720 :     1.1.1.2
1721 :     3.1.3.11
1722 :     fig|83333.1.peg.2881 Fructose-1,6-bisphosphatase (EC 3.1.3.11)
1723 :     fig|83333.1.peg.4142 Fructose-1,6-bisphosphatase (EC 3.1.3.11)
1724 :     5.4.2.2
1725 :     fig|83333.1.peg.678 Phosphoglucomutase (EC 5.4.2.2)
1726 :     </pre>
1727 :    
1728 :     Now let's examine the short program that produced the output:
1729 : olson 1.2
1730 :     <p>
1731 :     <blockquote>
1732 :     <table border="1">
1733 :     <tr>
1734 :     <td>connect_map_to_genes.pl</td>
1735 :     <td>connect_map_to_genes.py</td>
1736 :     </tr>
1737 :     <tr>
1738 :     <td bgcolor="f0f0f0"><pre>
1739 : overbeek 1.1 use FIG;
1740 :     my $fig = new FIG;
1741 :     use Sim;
1742 :    
1743 :     use strict;
1744 :     my($usage,$map,$genome,$name,$role,$peg,$func,$expanded_role);
1745 :    
1746 :     $usage = "usage: connect_map_to_genes Map Genome";
1747 :    
1748 :     (
1749 :     ($map = shift @ARGV) &&
1750 :     ($genome = shift @ARGV)
1751 :     )
1752 :     || die $usage;
1753 :    
1754 :     $name = $fig->map_name($map);
1755 :     print "$map\t$name\n";
1756 : olson 1.2 foreach $role ($fig->map_to_ecs($map))
1757 : overbeek 1.1 {
1758 :     print " $role\n";
1759 :     foreach $peg ($fig->seqs_with_role($role,"master",$genome))
1760 :     {
1761 : olson 1.2 $func = $fig->function_of($peg);
1762 :     print " $peg\t$func\n";
1763 : overbeek 1.1 }
1764 :     }
1765 : olson 1.2 </pre></td>
1766 :     <td bgcolor="f0f0f0"><pre>
1767 :     from FigKernelPackages import FIG2
1768 :     import sys
1769 :    
1770 :     fig = FIG2.FIG()
1771 :    
1772 :     def die(msg):
1773 :     print msg
1774 :     sys.exit(0)
1775 :    
1776 :     if len(sys.argv) != 3:
1777 :     die("usage: connect_map_to_genes Map Genome")
1778 :    
1779 :     map = sys.argv[1]
1780 :     genome = sys.argv[2]
1781 :     name = fig.map_name(map);
1782 :    
1783 :     print "%s\t%s" % (map, name)
1784 :     roles = fig.map_to_ecs(map)
1785 :     for role in roles:
1786 :     print role
1787 :     pegs = fig.seqs_with_role(role,"master",genome)
1788 :     for peg in pegs:
1789 :     func = fig.function_of(peg);
1790 :     print " %s\t%s" % (peg, func[0][1])
1791 :     </pre></td>
1792 :     </tr>
1793 :     </table>
1794 :     </blockquote>
1795 : overbeek 1.1
1796 :     returns the full name associated with the map.
1797 :     <li>You can get the the list of functional roles contained within a map using
1798 :     <pre>
1799 :     $fig->map_to_roles($map)
1800 :     </pre>
1801 :     <li>Because the roles within a map are often just EC numbers, you will
1802 :     usually wish to "expand" them by attaching a common name for the
1803 :     function represented by the EC number. You can do this using
1804 :     <pre>
1805 :     $expanded_role = $fig->expand_ec($role)
1806 :     </pre>
1807 :     I did not do that in this example, since I wanted you to see the EC
1808 :     numbers being used as functional roles in maps. You should modify the
1809 :     program to expand the ECs and rerun it.
1810 :     <li>
1811 :     Finally, you can get a list of the PEGs within a genome that have gene functions
1812 :     which connect to a specific functional role using
1813 :     <pre>
1814 :     $fig->seqs_with_role($role,"master",$genome)
1815 :     </pre>
1816 :     </ul>
1817 :     Since maps and subsystems are remarkably similar notions (both are
1818 :     sets of functional roles), you could also construct a similar program
1819 :     for subsystems. In the SEED, subsystems do not have IDs which are
1820 :     associated with full names; rather, the full name is itself the ID of
1821 :     the subsystem. With this in mind, the following program should be
1822 :     reasonably clear:
1823 :    
1824 : olson 1.2 <p>
1825 :     <blockquote>
1826 :     <table border="1">
1827 :     <tr>
1828 :     <td>connect_subsystem_to_genes.pl</td>
1829 :     <td>connect_subsystem_to_genes.py</td>
1830 :     </tr>
1831 :     <tr>
1832 :     <td bgcolor="f0f0f0"><pre>
1833 :    
1834 : overbeek 1.1 use FIG;
1835 :     my $fig = new FIG;
1836 :     use Sim;
1837 :    
1838 :     use strict;
1839 :     my($usage,$subsystem,$genome,$name,$role,$peg,$func);
1840 :    
1841 :     $usage = "usage: connect_subsystem_to_genes Subsystem Genome";
1842 :    
1843 :     (
1844 :     ($subsystem = shift @ARGV) &&
1845 :     ($genome = shift @ARGV)
1846 :     )
1847 :     || die $usage;
1848 :    
1849 :     print "$subsystem\n";
1850 :     foreach $role ($fig->subsystem_to_roles($subsystem))
1851 :     {
1852 :     print " $role\n";
1853 :     foreach $peg ($fig->seqs_with_role($role,"master",$genome))
1854 :     {
1855 : olson 1.2 $func = $fig->function_of($peg);
1856 :     print " $peg\t$func\n";
1857 : overbeek 1.1 }
1858 :     }
1859 : olson 1.2 </pre></td>
1860 :     <td bgcolor="f0f0f0"><pre>
1861 :     from FigKernelPackages import FIG2
1862 :     import sys
1863 :    
1864 :     fig = FIG2.FIG()
1865 :    
1866 :     def die(msg):
1867 :     print msg
1868 :     sys.exit(0)
1869 :    
1870 :     if len(sys.argv) != 3:
1871 :     die("usage: connect_subsystem_to_genes Subsystem Genome")
1872 :    
1873 :     subsystem = sys.argv[1]
1874 :     genome = sys.argv[2]
1875 :    
1876 :     print subsystem
1877 :     roles = fig.subsystem_to_roles(subsystem)
1878 :     for role in roles:
1879 :     print role
1880 :     pegs = fig.seqs_with_role(role,"master",genome)
1881 :     for peg in pegs:
1882 :     func = fig.function_of(peg);
1883 :     print "%s\t%s" % ( peg, func[0][1])
1884 :     </pre></td>
1885 :     </tr>
1886 :     </table>
1887 :     </blockquote>
1888 : overbeek 1.1
1889 :     <h3>Locating genes within an organism that connect to a specific subsystem</h3>
1890 :     <hr>
1891 :     Invoking this program using
1892 :     <pre>
1893 :     connect_subsystem_to_genes 'Embden-Meyerhof and Gluconogenesis' 83333.1
1894 :     </pre>
1895 :     would produce the following output:
1896 :     <pre>
1897 :     </pre>
1898 :     Note that the functional roles in subsystems tend to be far more
1899 :     detailed (and almost never just an EC number).
1900 :     <p>
1901 :     The problem with the above program is that we have chosen to view a
1902 :     subsystem as a spreadsheet in which each cell corresponds to a
1903 :     specific functional role in a designated organism. The gene functions
1904 :     of the PEGs contained in the cell usually connect to the specific
1905 :     role, but this is not forced. If someone changes the function of a
1906 :     gene, it does not disappear from the spreadsheet, unless the
1907 :     spreadsheet is rebuilt by forcing the entries to reflect current
1908 :     assignments and how they connect to the roles. We have chosen this
1909 :     approach to allow curators of spreadsheets to have precise control
1910 :     over the contents. People who import a subsystem may or may not
1911 :     instal the gene assignments needed to make the spreadsheet
1912 :     consistent, but the spreadsheet reflects the curators best assessment
1913 :     of which genes implement specific roles. This leads to my assertion
1914 :     that the following program is a better way to display a subsystem:
1915 :    
1916 : olson 1.2 <p>
1917 :     <blockquote>
1918 :     <table border="1">
1919 :     <tr>
1920 :     <td>connect_subsystem_to_genes_the_right_way.pl</td>
1921 :     <td>connect_subsystem_to_genes_the_right_way.py</td>
1922 :     </tr>
1923 :     <tr>
1924 :     <td bgcolor="f0f0f0"><pre>
1925 :    
1926 : overbeek 1.1 use FIG;
1927 :     my $fig = new FIG;
1928 :     use Sim;
1929 :    
1930 :     use strict;
1931 :     my($usage,$subsystem,$genome,$name,$role,$peg,$func);
1932 :    
1933 :     $usage = "usage: connect_subsystem_to_genes_the_right_way Subsystem Genome";
1934 :    
1935 :     (
1936 : olson 1.2 ($subsystem = shift @ARGV) &&
1937 :     ($genome = shift @ARGV)
1938 :     )
1939 : overbeek 1.1 || die $usage;
1940 :    
1941 :     print "$subsystem\n";
1942 :     foreach $role ($fig->subsystem_to_roles($subsystem))
1943 :     {
1944 :     print " $role\n";
1945 : olson 1.2 foreach $peg ($fig->pegs_in_subsystem_cell($subsystem, $genome, $role))
1946 : overbeek 1.1 {
1947 : olson 1.2 $func = $fig->function_of($peg);
1948 :     print " $peg\t$func\n";
1949 : overbeek 1.1 }
1950 :     }
1951 : olson 1.2 </pre></td>
1952 :     <td bgcolor="f0f0f0"><pre>
1953 :     from FigKernelPackages import FIG2
1954 :     import sys
1955 :    
1956 :     fig = FIG2.FIG()
1957 :    
1958 :     def die(msg):
1959 :     print msg
1960 :     sys.exit(0)
1961 :    
1962 :     if len(sys.argv) != 3:
1963 :     die("usage: connect_subsystem_to_genes_the_right_way Subsystem Genome")
1964 :    
1965 :     subsystem = sys.argv[1]
1966 :     genome = sys.argv[2]
1967 :    
1968 :     print subsystem
1969 :     roles = fig.subsystem_to_roles(subsystem)
1970 :    
1971 :     for role in roles:
1972 :     print role
1973 :     pegs = fig.pegs_in_subsystem_cell(subsystem, genome, role)
1974 :     for peg in pegs:
1975 :     func = fig.function_of(peg)
1976 :     print " %s\t%s" % (peg, func[0][1])
1977 :     </pre></td>
1978 :     </tr>
1979 :     </table>
1980 :     </blockquote>
1981 : overbeek 1.1
1982 :     <h3>Locating genes within an organism that connect to a specific
1983 :     subsystem the right way</h3>
1984 :     <hr>
1985 :    
1986 :     <p>
1987 :     We have just illustrated how to connect maps to functional roles and
1988 :     functional roles to gene functions. Now we need to go the other way.
1989 :     That is, given a gene function, what roles would it
1990 :     connect to, and which maps and subsystems contain those roles?
1991 :     The following program illustrates the services required to answer this
1992 :     question;
1993 :    
1994 : olson 1.2 <p>
1995 :     <blockquote>
1996 :     <table border="1">
1997 :     <tr>
1998 :     <td>gf2maps_and_subs.pl</td>
1999 :     <td>gf2maps_and_subs.py</td>
2000 :     </tr>
2001 :     <tr>
2002 :     <td bgcolor="f0f0f0"><pre>
2003 : overbeek 1.1 use FIG;
2004 :     my $fig = new FIG;
2005 :    
2006 :     use strict;
2007 :     my($usage,$peg,$func,@roles,$role,@maps,$map);
2008 :     my(%in_map,@maps_it_connects_to,@subsystems_it_is_in);
2009 :     my($map_name,$subsystem);
2010 :    
2011 :     $usage = "usage: gf2maps_and_subs PEG";
2012 :    
2013 :     # The single input argument must be a FIG id, which is used to get the assigned function.
2014 :     # The program then prints a list of maps to which the gene can be connected,
2015 :     # followed by a list of subsystems that contain the gene.
2016 :    
2017 :     ($peg = shift @ARGV)
2018 :     || die $usage;
2019 :    
2020 :     if (($func = $fig->function_of($peg)) && (! &FIG::hypo($func)))
2021 :     {
2022 :     @roles = $fig->roles_of_function($func);
2023 :     foreach $role (@roles)
2024 :     {
2025 : olson 1.2 @maps = $fig->role_to_maps($role);
2026 :     foreach $map (@maps)
2027 :     {
2028 :     $in_map{$map} = 1;
2029 :     }
2030 : overbeek 1.1 }
2031 :    
2032 : olson 1.2 @maps_it_connects_to = sort keys(%in_map);
2033 : overbeek 1.1
2034 :     if (@maps_it_connects_to > 0)
2035 :     {
2036 : olson 1.2 print "Maps that connect to $func\n\n";
2037 :     foreach $map (@maps_it_connects_to)
2038 :     {
2039 :     $map_name = $fig->map_name($map);
2040 :     print "$map\t$map_name\n";
2041 :     }
2042 :     print "\n";
2043 : overbeek 1.1 }
2044 :    
2045 :     }
2046 :     else
2047 :     {
2048 :     print STDERR "you need to give a valid PEG with a non-hypothetical function to connect to maps\n";
2049 :     }
2050 :    
2051 :     @subsystems_it_is_in = $fig->peg_to_subsystems($peg);
2052 :    
2053 :     if (@subsystems_it_is_in > 0)
2054 :     {
2055 :     print "Subsystems that contain $peg\n\n";
2056 :     foreach $subsystem (@subsystems_it_is_in)
2057 :     {
2058 : olson 1.2 print "$subsystem\n";
2059 : overbeek 1.1 }
2060 :     }
2061 : olson 1.2 </pre></td>
2062 :     <td bgcolor="f0f0f0"><pre>
2063 :     from FigKernelPackages import FIG2
2064 :     import sys
2065 :    
2066 :     fig = FIG2.FIG()
2067 :    
2068 :     def die(msg):
2069 :     print msg
2070 :     sys.exit(0)
2071 :    
2072 :     if len(sys.argv) != 2:
2073 :     die("usage: gf2maps_and_subs PEG")
2074 :    
2075 :     # The single input argument must be a FIG id, which is used to get the assigned function.
2076 :     # The program then prints a list of maps to which the gene can be connected,
2077 :     # followed by a list of subsystems that contain the gene.
2078 :    
2079 :     peg = sys.argv[1]
2080 :    
2081 :     func = fig.function_of(peg)
2082 :     func = func[0][1]
2083 :     hypo = fig.hypo(func)
2084 :     in_map = {}
2085 :    
2086 :     if func and hypo[0] == 0:
2087 :     roles = fig.roles_of_function(func)
2088 :     for role in roles:
2089 :     maps = fig.role_to_maps(role)
2090 :     for map in maps:
2091 :     in_map[map] = 1
2092 :    
2093 :     maps_it_connects_to = in_map.keys()
2094 :     maps_it_connects_to.sort()
2095 :    
2096 :     if len(maps_it_connects_to) > 0:
2097 :     print "Maps that connect to %s\n" % func
2098 :     for map in maps_it_connects_to:
2099 :     map_name = fig.map_name(map)
2100 :     print "%s\t%s" % (map, map_name[0])
2101 :     print " ";
2102 :     else:
2103 :     sys.stderr.write("you need to give a valid PEG with a non-hypothetical function to connect to maps\n")
2104 :    
2105 :     subsystems_it_is_in = fig.peg_to_subsystems(peg)
2106 :    
2107 :     if len(subsystems_it_is_in) > 0:
2108 :     print "Subsystems that contain %s\n" % peg
2109 :     for subsystem in subsystems_it_is_in:
2110 :     print subsystem
2111 :     </pre></td>
2112 :     </tr>
2113 :     </table>
2114 :     </blockquote>
2115 :    
2116 : overbeek 1.1 <h3>Going from a PEG to the maps and subsystems containing it</h3>
2117 :     <hr>
2118 :     Here, again, there are some of points worth discussing:
2119 :     <ul>
2120 :     <li> First, note that the input argument is a PEG, which we use to get
2121 :     the function that we will work with. You learned how to get the
2122 :     function in a previous example, but note the use of
2123 :     <pre>
2124 :     &FIG::hypo($func)
2125 :     </pre>
2126 :     to check to see whether or not the function is equivalent to
2127 :     <i>hypothetical protein</i>.
2128 :     <p>
2129 :     The way we explore connections between the PEG and maps is quite
2130 :     different than the way we connect the PEG to subsystems. In the case
2131 :     of maps, the connection is constructed by computing the set of roles
2132 :     to which the gene function connects, and then connecting each role to
2133 :     maps. In the case of subsystems, we simply look up which subsystems
2134 :     contain the given PEG. Subsystems explicitly contain a precise list
2135 :     of the PEGs that occur within them, which is not true of maps.
2136 :    
2137 :     <li>The key services for getting at the connections are as follows:
2138 :     <pre>
2139 :     @roles = $fig->roles_of_function($func);
2140 :     @maps = $fig->role_to_maps($role)
2141 :     @subsystems = $fig->peg_to_subsystems($peg);
2142 :     </pre>
2143 :     The first takes a function as input and returns a list of roles that
2144 :     the function connects to, the second takes a role and returns a list
2145 :     of maps to which the role connects, and the last takes a peg and
2146 :     returns the list of subsystems that contain the PEG.
2147 :     </ul>
2148 :     If you take a few minutes out and think about maps and subsystems, you
2149 :     will see that they are very, very similar concepts that are treated
2150 :     quite differently by the SEED.
2151 :    
2152 :     <h2>Annotations</h2>
2153 :    
2154 :     Annotations are time-stamped text notes made by a named user and
2155 :     attached to specific features. The most common are notes recording
2156 :     when someone made an assignment of function to a PEG, but they can be
2157 :     used to record anything. The usual way to access the annotations
2158 :     attached to a given feature with an ID designated by <i>$fid</i> is to
2159 :     use
2160 :     <pre>
2161 :     @annotations = $fig->feature_annotations($fid)
2162 :     </pre>
2163 :     The returned list contains 4-tuples (i.e., pointers to lists, each of
2164 :     which contains 4 entries). Each 4-tuple contains
2165 :     <ol>
2166 :     <li> the FID (this is redundant, but it was convenient for reasons
2167 :     that will become apparent),
2168 :     <li>a timestamp,
2169 :     <li>the name of the user that made the annotation, and
2170 :     <li>the annotation in the form of a text string.
2171 :     </ol>
2172 :     The following short program uses
2173 :     <pre>
2174 :     @annotations = $fig->merged_related_annotations($fig_ids)
2175 :     </pre>
2176 :     which is a useful generalization. This routine takes a pointer to a list of
2177 :     feature IDs as input, and it produces a merged list of the annotations
2178 :     attached to all of the features. This can often be very convenient.
2179 :     For example, you might have a set of similar PEGs that are all
2180 :     misannotated, and you wish to reconstruct the steps that led to the
2181 :     current situation. By merging the annotations (which normally record
2182 :     both who made the annotations and why), you can get a record of the
2183 :     events in chronilogical order. The following short example
2184 :     illustrates this utility:
2185 :    
2186 : olson 1.2 <p>
2187 :     <blockquote>
2188 :     <table border="1">
2189 :     <tr>
2190 :     <td>annotations_of.pl</td>
2191 :     <td>annotations_of.py</td>
2192 :     </tr>
2193 :     <tr>
2194 :     <td bgcolor="f0f0f0"><pre>
2195 : overbeek 1.1 use FIG;
2196 :     my $fig = new FIG;
2197 :     use strict;
2198 :    
2199 :     my($usage,$fig_ids,@annotations);
2200 :    
2201 :     $usage = "usage: annotations_of Fid1 Fid2 ...";
2202 :    
2203 :     (@ARGV > 0)
2204 :     || die $usage;
2205 :    
2206 :     $fig_ids = [@ARGV];
2207 :    
2208 :     @annotations = $fig->merged_related_annotations($fig_ids);
2209 :     if (@annotations > 0)
2210 :     {
2211 :     &display_annotations(\@annotations);
2212 :     }
2213 :     else
2214 :     {
2215 :     print STDERR "Sorry, there were no annotations attached to the features\n";
2216 :     }
2217 :    
2218 :     sub display_annotations {
2219 :     my($annotations) = @_;
2220 :     my($annotation,$fid,$timestamp,$user,$text);
2221 :    
2222 :     foreach $annotation (@$annotations)
2223 :     {
2224 : olson 1.2 ($fid,$timestamp,$user,$text) = @$annotation;
2225 :     print "=======\n";
2226 :     print "$user\t$timestamp\t$fid\n$text\n";
2227 :     print "=======\n\n";
2228 : overbeek 1.1 }
2229 :     }
2230 : olson 1.2 </pre></td>
2231 :     <td bgcolor="f0f0f0"><pre>
2232 :     from FigKernelPackages import FIG
2233 :     import sys
2234 :    
2235 :     def display_annotations(annotations):
2236 :     for annotation in annotations:
2237 :     fid,timestamp,user,text = annotation
2238 :     print "=======\n";
2239 :     print "%s\t%s\t%s\n%s\n" % ( user, timestamp, fid, text)
2240 :     print "=======\n\n"
2241 :    
2242 :     fig = FIG.FIG()
2243 :    
2244 :     def die(msg):
2245 :     print msg
2246 :     sys.exit(0)
2247 :    
2248 :     if len(sys.argv) < 2:
2249 :     die("usage: annotations_of Fid1, Fid2, ... ")
2250 :    
2251 :     fig_ids = sys.argv[1:]
2252 :    
2253 :     annotations = fig.merged_related_annotations(fig_ids)
2254 :     if len(annotations) > 0:
2255 :     display_annotations(annotations)
2256 :     else:
2257 :     sys.stderr.write("Sorry, there were no annotations attached to the features\n")
2258 :    
2259 :     </pre></td>
2260 :     </tr>
2261 :     </table>
2262 :     </blockquote>
2263 :    
2264 : overbeek 1.1
2265 :     <h3>Accessing the annotations from a set of features</h3>
2266 :     <hr>
2267 :     Invoking this little utility using
2268 :     <pre>
2269 :     annotations_of 'fig|207559.1.peg.735' 'fig|12149.1.peg.2133' 'fig|1525.1.peg.1757'
2270 :     </pre>
2271 :     produced the following output:
2272 :     <pre>
2273 :     =======
2274 :     automated_assignment Thu Mar 18 17:42:51 2004 fig|12149.1.peg.2133
2275 :     Set master function to
2276 :     ATP phosphoribosyltransferase (EC 2.4.2.1
2277 :     =======
2278 :    
2279 :     =======
2280 :     automated_assignment Thu Mar 18 17:45:16 2004 fig|1525.1.peg.1757
2281 :     Set master function to
2282 :     ATP phosphoribosyltransferase (EC 2.4.2.1
2283 :     =======
2284 :    
2285 :     =======
2286 :     last_release Fri Apr 2 14:14:07 2004 fig|207559.1.peg.735
2287 :     Set master function to
2288 :     ATP phosphoribosyltransferase (EC 2.4.2.1
2289 :     =======
2290 :     </pre>
2291 :    
2292 :     <h2>Assigning Gene Functions</h2>
2293 :    
2294 :     We can now easily talk about assigning functions and making
2295 :     annotations to record the events. Here is a simple utility
2296 :     that can be used to make annotations:
2297 :    
2298 : olson 1.2 <p>
2299 :     <blockquote>
2300 :     <table border="1">
2301 :     <tr>
2302 :     <td>batch_assignments.pl</td>
2303 :     </tr>
2304 :     <tr>
2305 :     <td bgcolor="f0f0f0"><pre>
2306 : overbeek 1.1 use FIG;
2307 :     my $fig = new FIG;
2308 :    
2309 :     use strict;
2310 :     my($usage,$user,$annotation,$text,$peg,$func,$conf,$user_name,$funcO);
2311 :    
2312 :     $usage = "usage: batch_assignments User [AnnotationFile] < Assignments";
2313 :    
2314 :     ($user = shift @ARGV)
2315 :     || die $usage;
2316 :    
2317 :     if ($annotation = shift @ARGV)
2318 :     {
2319 :     $text = join("",`cat $annotation`);
2320 :     }
2321 :     else
2322 :     {
2323 :     $text = "";
2324 :     }
2325 :    
2326 :     while (defined($_ = <STDIN>))
2327 :     {
2328 :     chop;
2329 :     ($peg,$func,$conf) = split(/\t/,$_);
2330 :     if (! $conf) { $conf = "" }
2331 :     if ($user =~ /master:(.*)/)
2332 :     {
2333 :     $user_name = $1;
2334 :     $funcO = $fig->function_of($peg);
2335 :     if ($funcO ne $func)
2336 :     {
2337 :     $fig->assign_function($peg,"master",$func,$conf);
2338 :     $fig->add_annotation($peg,$user_name,"Set master function to\n$func\n$text");
2339 :     }
2340 :     }
2341 :     else
2342 :     {
2343 :     $funcO = $fig->function_of($peg,$user);
2344 :     if ($funcO ne $func)
2345 :     {
2346 :     $fig->assign_function($peg,$user,$func,$conf);
2347 :     $fig->add_annotation($peg,$user,"Set function to\n$func\n$text");
2348 :     }
2349 :     }
2350 :     }
2351 :    
2352 : olson 1.2 </pre></td>
2353 :     </tr>
2354 :     </table>
2355 :     </blockquote>
2356 : overbeek 1.1
2357 :     <h3>A utility to make batch assignments</h3>
2358 :     <hr>
2359 :     Before discussing the details of this code, we should discuss the
2360 :     issue of "master and non-master users". Each gene can have a single
2361 :     <i>master assignment</i> and any number of <i>non-master
2362 :     assignments</i>. A user who has signed in as <i>master:USER</i>
2363 :     (e.g., "master:RossO") is a master user. If a master user makes an
2364 :     assignment, it overwrites the master assignment. If a non-master user
2365 :     makes an assignment, it alters only the assignment corresponding to
2366 :     that user, if there is one. When a master asks for the function of a
2367 :     gene, he gets the master assignment. If a non-master asks for the
2368 :     function of a gene, he will get the assignment corresponding to his
2369 :     user id, if there is one; if not, he will get the master assignment,
2370 :     if there is one.
2371 :     <p>
2372 :     There is also the matter of <i>confidence levels</i>. The SEED
2373 :     supports attaching confidence codes to assignments, although they are
2374 :     seldom used. The codes we use at this point are:
2375 :     <ul>
2376 :     <li> an empty string, a space or <b>N</B> means "normal or average confidence",
2377 :     <li> an <b>H</b> or an <b>S</B> means "high or strong confidence", and
2378 :     <li>an <b>L</b> or a <b>W</b> means "low or weak confidence".
2379 :     </ol>
2380 :     We are not really employing the codes actively. The major source of
2381 :     strength in an assignment is being part of a subsystem or a
2382 :     well-characterized class of proteins. This will clarify substantially
2383 :     as the Project to Annotate 1000 Genomes progresses.
2384 :     <p>
2385 :     Now, back to the example program:
2386 :     <ul>
2387 :     <li>
2388 :     The program is written to allow the
2389 :     user to optionally attach a block of annotation to each of the genes
2390 :     that get reassigned. If the optional argument (the second argument)
2391 :     is prtesent, it is presumed to be the name of a file containing text
2392 :     to be appended to the annotation the program makes to record the
2393 :     change of assignment. The line of code
2394 :     <pre>
2395 :     $text = join("",`cat $annotation`);
2396 :     </pre>
2397 :     just reads the entire file, concatenates the lines, and assigns the
2398 :     result to the variable <i>$text</i>.
2399 :     <li>
2400 :     We default the <i>confidence level</i> to the empty string.
2401 :     <li> We only make asssignments if the current function differs from
2402 :     the one to be assigned. <i>$funcO</i> is used to hold the existing
2403 :     asssignment. Note that
2404 :     <pre>
2405 :     $funcO = $fig->function_of($peg,$user);
2406 :     </pre>
2407 :     uses a form of <i>$fig->function_of</i> in which a second argument is
2408 :     used to indicate a non-master user.
2409 :     <li>
2410 :     The
2411 :     <pre>
2412 :     $fig->assign_function($peg,$user,$func,$conf);
2413 :     </pre>
2414 :     is used to actually make the assignment.
2415 :     <li>We record the assignment by adding an annotation (possibly with
2416 :     the text from the optional file appended) using
2417 :     <pre>
2418 :     $fig->add_annotation($peg,$user,"Set function to\n$func\n$text");
2419 :     </pre>
2420 :     </ul>
2421 :    
2422 :     <h2>BBHs</h2>
2423 :    
2424 :     Now, let us return briefly to the topic of similarities. The notion
2425 :     of <i>bi-directional best hit</i> (<b>BBH</b>) often pops up in
2426 :     discussions relating to similarities and how to project assignments
2427 :     from a gene in one genome to a gene in another. A gene <i>X</i> in
2428 :     genome <i>Gx</i> is a BBH of a gene <i>Y</i> in <i>Gy</i> iff
2429 :     <ol>
2430 :     <li><i>X</i> is similar to <i>Y</i>,
2431 : redwards 1.3 there is no other gene in <i>Gy</i> more similar to <i>X</i>, and
2432 :     there is no other gene in <i>Gx</i> more similar to <i>Y</i>.
2433 : overbeek 1.1 </ol>
2434 :     To get the BBHs for a given PEG, you would use the something like the
2435 :     following:
2436 :     <pre>
2437 :     @bbhs = $fig->bbhs($peg,1.0e-10);
2438 :     </pre>
2439 :     This would set <i>@bbhs</i> to a list of 2-tuples. Each 2-tuple would
2440 :     contain a PEG and a p-score. They would correspond to BBHs which have
2441 :     similarity better than 1.0e-10.
2442 :    
2443 :     <h2>The Use of dsims</h2>
2444 :    
2445 :     We have constructed a means of computing <i>dynamic similarities</i>
2446 :     fairly rapidly. The basic idea involves breaking the nonredundant
2447 :     protein database into many thousands of subsets, where all of the
2448 :     sequences in a subset are quite similar (sequences that do not fall
2449 :     into subsets get placed into a <i>nohits database</i>). Then, one or more
2450 :     representative sequences from each subset are included in a database
2451 :     of <i>exemplars</i>.
2452 :     <p>
2453 :     To get <i>blastp</i> similarities quickly (and somewhat less
2454 :     accurately than just blasting against the nonredundant database), use
2455 :     something like
2456 :     <pre>
2457 :     @sims = $fig->dsims($id,$seq,200,1.0e-10,"raw")
2458 :     </pre>
2459 :     This would return the similarities that can rapidly be detected
2460 :     between the protein sequence with id given by <i>id</i> and sequence
2461 :     given by <i>$seq</i>. It would give you at most 200 similarities
2462 :     with p-scores better than 1.0e-10, and you would get only the
2463 :     <i>raw</i> similarities (<i>all</i> would return the expanded
2464 :     similarities).
2465 :    
2466 :     <h1> PAUSE: I am pausing here; I hope to finish this tutorial within a few months</h1>
2467 :    
2468 :     <h2>Clusters, Pins, and Largest Clusters</h2>
2469 :    
2470 :     <h2>The Indexes</h2>
2471 :    
2472 :     <h2>Administrative Issues</h2>
2473 :    
2474 :     <h2>Atomated Assignment of Function</h2>
2475 :    
2476 :     <h2>Finding Candidates for a Functional Role</h2>
2477 :    
2478 :     <h2>Protein Families</h2>
2479 :    
2480 :     <h2>Compounds and Reactions</h2>
2481 :    
2482 :     <h2>Subsystems</h2>
2483 :    
2484 :     <h2>Links to PEGs</h2>
2485 :    
2486 :     <h2>Miscellaneous Stuff that You Should Know</h2>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3