[Bio] / FigTutorial / API.html Repository:
ViewVC logotype

Annotation of /FigTutorial/API.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : overbeek 1.1 <h1>Accessing SEED Services</h1>
2 :     <h2>Introduction</h2>
3 :     This document is designed to help a programmer understand how to
4 :     access the services provided by the SEED. It is not designed for the
5 :     user who wishes to confine his experience to working with the SEED
6 :     browser over the Web.
7 :     <p>
8 :     Before we launch into a discussion of the details of how to invoke
9 :     specific routines, there are some background issues that you should
10 :     know about. First, let's talk about <b>fig</b>, which is a
11 :     command-line utility that is used for a number of purposes. Its most
12 :     basic purpose is to exercise the API and illustrate what data is
13 :     accessed by each operation. However, you may also find it extremely
14 :     useful as a quick way to invoke services. To try it out, type
15 :     <pre>
16 :     fig
17 :     </pre>
18 :     This should produce a prompt of the form
19 :     <pre>
20 :     ??
21 :     </pre>
22 :     You should type
23 :     <pre>
24 :     ?? h&lt;cr&gt;
25 :     </pre>
26 :     to get a listing of the commands
27 :     supported by the utility.
28 :     At the time this document was written (May, 2004), the output was as follows:
29 :     <pre>
30 :     <b>?? h</b>
31 :     DB Current DB
32 :     abbrev genome_name
33 :     access_sims [expand|func|expandfunc] (for timeing - access 10 distinct sims)
34 :     add_annotation FID User [ prompted for annotation; terminate with "." at start of line ]
35 :     add_chromosomal_clusters File
36 :     add_genome GenomeDir
37 :     add_pch_pins File
38 :     add_peg_link PEG Link (e.g., &lt;a href=http//...&gt;link to somewhere&lt;/a&gt;)
39 :     all_compounds
40 :     all_features GenomeID Type
41 :     all_maps
42 :     all_protein_families
43 :     all_reactions
44 :     all_roles
45 :     all_sets Relation SetName
46 :     assign_function PEG User [conf=X] Function
47 :     assign_functionF User File [no_annotations]
48 :     assignments_made who date G1 G2 ... [ none => all ]
49 :     auto_assign PEG [Seq]
50 :     auto_assignF User FileOfPEGs
51 :     auto_assignG User GenomeID1 GenomeID2 ...
52 :     bbhs PEG Cutoff
53 :     between n1 n2 n3
54 :     blast PEG [against nr]
55 :     blastit PEG seq db maxP
56 :     boundaries_of Loc
57 :     build_tree_of_complete min_for_label
58 :     by_alias alias
59 :     by_fig_id FID1 FID2
60 :     candidates_for_role Genome Cutoff Role
61 :     cas CID
62 :     cas_to_cid cas
63 :     catalyzed_by RID
64 :     catalyzes role
65 :     cgi_url
66 :     clean_tmp
67 :     close_genes FID distance
68 :     comp2react CID
69 :     contig_ln GenomeID Contig
70 :     coupling_and_evidence FID Bound SimCutoff CouplingCutoff
71 :     crude_estimate_of_distance GenomeID1 GenomeID2
72 :     delete_all_peg_links PEG
73 :     delete_genomes G1 G2 G3 ...Gn
74 :     delete_peg_link PEG URL
75 :     displayable_reaction RID
76 :     dna_seq GenomeID Loc
77 :     dsims FastaFile [MaxN MaxPsc Select] *** select defaults to raw
78 :     ec_to_maps EC
79 :     ec_name EC
80 :     expand_ec EC
81 :     epoch_to_readable [gives readable version of current time]
82 :     export_chromosomal_clusters
83 :     export_pch_pins
84 :     export_set Relation SetName File
85 :     exportable_subsystem Subsystem
86 :     external_calls PEG1 PEG2 PEG3...
87 :     extract_seq ContigsFile loc
88 :     all_exchangable_subsystems
89 :     best_bbh_candidates Genome Cutoff Requested Known
90 :     family_function Family
91 :     fast_coupling FID Bound CouplingCutoff
92 :     feature_aliases FID
93 :     feature_annotations FID
94 :     feature_location FID
95 :     file2N File
96 :     ftype FID
97 :     function_of ID [all] [trans]
98 :     genes_in_region GenomeID Contig Beg End
99 :     genome_of PEG
100 :     genomes [complete|Pat]
101 :     genome_counts [complete]
102 :     genome_version GenomeID
103 :     genus_species GenomeID
104 :     get_translation ID
105 :     get_translations File [tab]
106 :     h [command] *** h h for help on how to use help ***
107 :     hypo Function
108 :     ids_in_family Family
109 :     ids_in_set WhichSet Relation SetName
110 :     in_cluster_with PEG
111 :     in_family FID
112 :     in_pch_pin_with FID
113 :     in_sets Which Relation SetName
114 :     is_archaeal GenomeID
115 :     is_bacterial GenomeID
116 :     is_eukaryotic GenomeID
117 :     is_prokaryotic GenomeID
118 :     is_exchangable_subsystem Subsystem
119 :     is_real_feature FID
120 :     largest_clusters FileOfRoles user
121 :     load_all
122 :     map_to_ecs map
123 :     map_name map
124 :     mapped_prot_ids ID
125 :     maps_to_id ID
126 :     max n1 n2 n3 ...
127 :     merged_related_annotations PEG1 PEG2 ... PEGn
128 :     min n1 n2 n3 ...
129 :     names_of_compound
130 :     neighborhood_of_role role
131 :     org_of prot_id
132 :     pegs_of GenomeID
133 :     possibly_truncated FID
134 :     reaction2comp RID
135 :     related_by_func_sim PEG user
136 :     reversible RID
137 :     rnas_of GenomeID
138 :     roles_of_function function
139 :     same_func 'function1' 'function2'
140 :     search_index pattern
141 :     seqs_with_role role who
142 :     seqs_with_roles_in_genomes genomes roles made_by
143 :     sims PEG maxN maxP select
144 :     sort_fids_by_taxonomy FID1 FID2 FID3 ...
145 :     sort_genomes_by_taxonomy GenomeID1 GenomeID2 GenomeID3 ...
146 :     sz_family family
147 :     taxonomic_groups_of_complete min_for_labels
148 :     taxonomy_of GenomeID
149 :     translatable prot_id
150 :     translate_function PEG user
151 :     translated_function_of PEG user
152 :     translation_length ID
153 :     unique_functions PEGs user [make the pegs comma separated]
154 :     verify_dir dir
155 :    
156 :     </pre>
157 :     If you quickly browse through the commands that are supported by <b>fig</b>,
158 :     you should get a fairly complete overview of the services provided by the API.
159 :     We are not going to cover the use of <b>fig</b> in detail in this
160 :     tutorial, but we do encourage you to experiment with it. By the way,
161 :     you should type <b>x</b> to exit from it.
162 :     <p>
163 :     <b>fig</b> is ususally accessed in interactive mode (invoked with no
164 :     arguments). However, if you wish to invoke it to process a single
165 :     command, you can just put the command as a single argument on the
166 :     command-line. Thus,
167 :     <pre>
168 :     fig 'add_peg_link fig|188.1.peg.96 &lt;a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/&gt;animation&lt;/a&gt;'
169 :     </pre>
170 :     would add a link to PEG <i>fig|188.1.peg.96</i> while
171 :     <pre>
172 :     fig 'delete_all_peg_links fig|188.1.peg.96'
173 :     </pre>
174 :     would delete all links from the PEG. Since adding links to PEGs is
175 :     something you might wish to do, the easiest way to do it is to just
176 :     create a file of commands of the sort shown, make the file executable,
177 :     and then run it.
178 :     <p>
179 :     If you are a Perl programmer, eventually you will wish to peruse the
180 :     <b>fig</b> source code to see how it is structured, how to add to it,
181 :     and how to invoke services via the API.
182 :     <p>
183 :     We have briefly discussed the utility <b>fig</b>. Now it is time to
184 :     discuss writing your own programs using the API. Basically, the
185 :     entire API is encapsulated in the Perl module <b>FIG.pm</b>. To illustrate
186 :     the style of programming we recommend, let's create a simple program
187 :     to add a link to a PEG (rather than using <b>fig</b>).
188 :     Adding a link
189 :     to a PEG will cause the link to be displayed whenever anyone examines
190 :     the PEG via the SEED web services..
191 :     The API defines a set of services that do many things; adding a link
192 :     to a PEG is just one very minor service, but understanding exactly how one
193 :     could invoke such a service is what this tutorial is about.
194 :     To begin, use
195 :     <pre>
196 :     tool_hdr add_link
197 :     </pre>
198 :     This creates a file called <i>add_link</i> and makes it executable.
199 :     The file contains an initial set of Perl commands designed to
200 :     establish the correct execution environment (where the libraries are,
201 :     and so forth). You need to add your code to the end of the
202 :     initialized file. We recommend adding the following lines:
203 :     <hr>
204 :     <pre>
205 :     use FIG;
206 :     my $fig = new FIG;
207 :    
208 :     $usage = "usage: add_link PEG Link";
209 :    
210 :     (
211 :     ($peg = shift @ARGV) &&
212 :     ($link = shift @ARGV)
213 :     )
214 :     || die $usage;
215 :    
216 :     $fig->add_peg_links([$peg,$link]);
217 :     </pre>
218 :     <h3>A short program to add a link to a PEG</h3>
219 :     <hr>
220 :     We urge you to study this short program carefully, since it reflects
221 :     the style we will use throughout this tutorial. There are a number of
222 :     things worth noting:
223 :     <ul>
224 :     <li>
225 :     The lines
226 :     <pre>
227 :     use FIG;
228 :     my $fig = new FIG;
229 :     </pre>
230 :     initialize <i>$fig</i>. This is your key to the fig services. All
231 :     API services are invoked using
232 :     <pre>
233 :     $fig->rtn(arguments)
234 :     </pre>
235 :     <li>
236 :     The last line of the program is the actual invocation of the service
237 :     to add a link to a peg. Actually, <i>add_peg_links</i> is a service
238 :     routine that takes any number of arguments. Each argument is a
239 :     2-tuple (i.e., a pointer to a list containing two elements). Each
240 :     2-tuple contains a PEG and a link. Links should be of the form:
241 :     <pre>
242 :     &lt;a href=URL&gt;LABEL&lt;/a&gt;
243 :     </pre>
244 :     </ul>
245 :     We suggest that you type in this program and gget it to run on your
246 :     version of the SEED. Then, use <b>fig</b> to delete any links you
247 :     added.
248 :     <h2>Accessing Data Relating to Genomes</h2>
249 :     Let us now begin a somewhat more systematic exploration of the services offered by the API.
250 :     We will begin with genome-related services, of which there are relatively few in the SEED.
251 :     Essentially, genomes are represented by IDs of the form <i>xxx.y</i>
252 :     where <i>xxx</i> is the NCBI taxon ID, and <i>y</i> is a suffix to
253 :     disambiguate situations in which you have multiple genomes with the
254 :     same taxon ID.
255 :    
256 :     <hr>
257 :     <pre>
258 :     use strict;
259 :     use FIG;
260 :     my $fig = new FIG;
261 :    
262 :     my(%sofar,$genome,$which,$abbrev,$version);
263 :    
264 :     foreach $genome ($fig->genomes("complete"))
265 :     {
266 :     $which = $fig->genus_species($genome);
267 :     $abbrev = &get_unique_abbreviation($which,\%sofar);
268 :     $version = $fig->genome_version($genome);
269 :     print "$genome\t$abbrev\t$version\t$which\n";
270 :     }
271 :    
272 :     sub get_unique_abbreviation {
273 :     my($which,$sofar) = @_;
274 :     my($nonunique,$n);
275 :    
276 :     $nonunique = &FIG::abbrev($which);
277 :     $nonunique =~ s/ //g;
278 :     $nonunique =~ s/\.+$//;
279 :     $n = $sofar->{$nonunique};
280 :     if (! defined($n))
281 :     {
282 :     $n = $sofar->{$nonunique} = 1;
283 :     }
284 :     $sofar->{$nonunique}++;
285 :     return $nonunique . ".$n";
286 :     }
287 :     </pre>
288 :     <h3>An illustration of genome-related services</h3>
289 :     <hr>
290 :     There are a few points worth noting about the short example:
291 :     <ul>
292 :     <li>
293 :     <i>$fig->genomes</i> returns a list of all of the genome IDs. It has
294 :     two arguments that can be omitted. The first is used to request only
295 :     complete (or near complete) genomes (this is used in the example).
296 :     For each genome in the SEED there is a directory in
297 :     <i>$FIG_Config::organisms</i> that encapsulates data about the genome.
298 :     Genomes will be considered "complete" iff there exists a file
299 :     <i>$FIG_Config::organisms/$genome/COMPLETE</i> (where $genome contains the
300 :     genome ID). The second argument can be used to restrict the returned
301 :     list of genomes to those that have restrictions on distribution
302 :     (indicated by the presence of
303 :     <i>$FIG_Config::organisms/$genome/RESTRICTIONS</i>). Thus,
304 :     <i>$fig->genomes("complete","restricted")</i> could be used to get a
305 :     list of complete, but restricted, genomes.
306 :     <li>
307 :     There are a number of routines contained in <i>FIG.pm</i> which are
308 :     not really services, but are there because they are simply offer
309 :     convenient subroutines. They probably should not be part of an API.
310 :     For example <i>&FIG::between($x,$y,$z)</i> can be invoked to return
311 :     "true" iff <i>$y</i> is between <i>$x</i> and <i>$z</i>.
312 :     <i>&FIG::abbrev($which)</i> invokes a simple routine that produces an
313 :     abbrevbiation of a genome description. It will usually contain
314 :     spaces and sometimes trailing periods, so if you want an abbreviation without these, you have to get
315 :     rid of them. Further, the abbreviations are not unique (that is,
316 :     invoking this routine for two distinct genomes can produce the same
317 :     result). Hence, we give a simple little routine that produces unique
318 :     abbreviations (suitable, for example, to use for sequences in an
319 :     alignment).
320 :     <li>
321 :     Finally, it is worth noting that, as of now, we think of a genome
322 :     "version" as composed of a genome ID concatenated with a checksum.
323 :     This may or may not change in the future.
324 :     </ul>
325 :     Here are just a few lines from the output of this little program:
326 :     <pre>
327 :     197221.1 The.elon.BP.1 197221.1.3406724106 Thermosynechococcus elongatus BP-1
328 :     198094.1 Bac.anth.st.2 198094.1.2593352402 Bacillus anthracis str. Ames
329 :     198214.1 Shi.flex.2a.1 198214.1.1464268764 Shigella flexneri 2a str. 301
330 :     198215.1 Shi.flex.2a.2 198215.1.1756034760 Shigella flexneri 2a str. 2457T
331 :     </pre>
332 :     What other genome-related services does the SEED offer? The only ones
333 :     worth mentioning here relate to <i>contigs</i>. Here is a short
334 :     example illustrating how to use those services. Basically, you can
335 :     <ol>
336 :     <li>
337 :     get a list of the contigs in a given genome,
338 :     <li>
339 :     get the length of a specified contig, and
340 :     <li>
341 :     extract sequence from a contig.
342 :     </ol>
343 :     To illustrate these features, here is a short program that list all of
344 :     the contigs for a designated genome. For each contig that is over 50
345 :     characters in size (and, they all should be), it displays the contig
346 :     ID, the contig length, and the first 50 characters of sequence.
347 :    
348 :     <hr>
349 :     <pre>
350 :     use strict;
351 :     use FIG;
352 :     my $fig = new FIG;
353 :    
354 :     my $usage = "usage: contigs_for_genome Genome";
355 :    
356 :     my($genome,@contigs,$contig,$len,$seq);
357 :    
358 :     ($genome = shift @ARGV)
359 :     || die $usage;
360 :    
361 :     @contigs = $fig->all_contigs($genome);
362 :     foreach $contig (@contigs)
363 :     {
364 :     $len = $fig->contig_ln($genome,$contig);
365 :     if ($len >= 50)
366 :     {
367 :     $seq = $fig->dna_seq($genome,$contig . "_1_50");
368 :     print "$contig\t$len\t$seq\n";
369 :     }
370 :     }
371 :    
372 :     </pre>
373 :     <h3>How to access services related to contigs</h3>
374 :     <hr>
375 :     The services being used in this short example are as follows:
376 :     <ul>
377 :     <li>
378 :     <i>$fig->all_contigs($genome)</i> returns a list of the contigs that
379 :     make up the designated genome,
380 :     <li>
381 :     <i>$fig->contig_ln($genome,$contig)</i> returns the length of the
382 :     designated contig, and
383 :     <li>
384 :     <i>$fig->dna_seq($genome,$contig . "_1_50")</i> returns the first 50
385 :     characters of the contig.
386 :     </ul>
387 :     This last service (<i>dna_seq</i>) requires some explanation. You
388 :     need to know that
389 :     <ol>
390 :     <li>
391 :     Locations on a contig begin with 1 (not 0 as some computer scientists
392 :     would prefer).
393 :     <li>
394 :     Within the SEED, locations are represented as a comma-separated list
395 :     of <i>contiguous regions</i>. A contiguous region is represented by a
396 :     contig ID, a beginning
397 :     position, and an ending position all separated by underscores. One
398 :     fact that can be a little confusing is that contig IDs themselves may
399 :     contain underscores. Thus,
400 :     <pre>
401 :     NC_000913_1_50
402 :     </pre>
403 :     would be used to represent the first fifty chracters of the contig
404 :     NC_000913.
405 :     <li>
406 :     Sections of sequence on the complementary strand are represented
407 :     similarly (with the beginning point greater than the ending
408 :     position). Thus,
409 :     <pre>
410 :     NC_000913_50_1
411 :     </pre>
412 :     would represent the sequence on the complementary strand of NC_000913
413 :     going from position 50 to position 1.
414 :     <li>
415 :     The routine <i>dna_seq</i> will accept a location (i.e., a
416 :     comma-separated list of contiguous regions), or even just a list of
417 :     arguments each representing a contiguous region, and return
418 :     the concatenation of the sequences from those locations. Thus,
419 :     <pre>
420 :     $fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100")
421 :     or
422 :     $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")
423 :     </pre>
424 :     would return 99 characters formed by deleting character 51 from the
425 :     first 100 characters of NC_000913.
426 :     </ol>
427 :    
428 :     <h2>Accessing Data Relating to Features</h2>
429 :    
430 :     The SEED was designed, like WIT and ERGO before it, to allow a variety
431 :     of types of features. In fact, the SEED is essentially focused on
432 :     protein-encoding genes, which I perversely named <i>pegs</i> rather
433 :     than <i>CDSs</i>. It is true that WIT contained RNAs, inteins,
434 :     introns, and repeat sequences, and ERGO contained RNAs and IS
435 :     elements; however, the SEED really only has pegs and RNAs at this
436 :     point (and, the RNAs are not adequately handled).
437 :     <p>
438 :     There is a fairly rich set of services for pegs, as you might guess
439 :     from the web services offered by the SEED. As an introduction to some
440 :     of these capabilities, consider the following short program:
441 :    
442 :     <hr>
443 :     <pre>
444 :     use strict;
445 :     my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg);
446 :     my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region);
447 :     my($start_of_leftmost_feature,$end_of_rightmost_feature);
448 :     my($genome);
449 :    
450 :     use FIG;
451 :     my $fig = new FIG;
452 :    
453 :     $usage = "usage: features_around ID";
454 :    
455 :     ($id = shift @ARGV)
456 :     || die $usage;
457 :    
458 :     if ($peg = $fig->by_alias($id))
459 :     {
460 :     if ($loc = $fig->feature_location($peg))
461 :     {
462 :     ($contig,$beg,$end) = $fig->boundaries_of($loc);
463 :     if (defined($contig))
464 :     {
465 :     $start_reg = &FIG::min($beg,$end) - 5000;
466 :     $end_reg = &FIG::max($beg,$end) + 5000;
467 :     $genome = &FIG::genome_of($peg);
468 :     ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
469 :     $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
470 :     foreach $fid (@$features_in_region)
471 :     {
472 :     $loc1 = $fig->feature_location($fid);
473 :     $aliases1 = $fig->feature_aliases($fid);
474 :     $trunc = $fig->possibly_truncated($fid);
475 :    
476 :     $pseq = $func = $prot_ln = "";
477 :    
478 :     if (&FIG::ftype($fid) eq "peg")
479 :     {
480 :     if ($prot_ln = $fig->translation_length($fid))
481 :     {
482 :     $pseq = $fig->get_translation($fid);
483 :     $func = $fig->function_of($fid);
484 :     }
485 :     }
486 :     print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n";
487 :     }
488 :     }
489 :     }
490 :     else
491 :     {
492 :     print STDERR "Sorry, could not get the location of $fid\n";
493 :     }
494 :     }
495 :     else
496 :     {
497 :     print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
498 :     }
499 :     </pre>
500 :     <h3>Displaying features close to a given feature</h3>
501 :     <hr>
502 :     This short program takes as an argument an ID that corresponds to a
503 :     feature. The ID can be a FIG feature ID of the form
504 :     <pre>
505 :     fig|xxxx.y.peg.zzzz
506 :     </pre>
507 :     where <i>xxxx.y</i> is the genome ID and <i>zzzz</i> is a unique
508 :     sequence number (these usually occur in the same order as the genes
509 :     occur on the contigs, but this is not guaranteed to be true). The
510 :     argument could also be any other ID known by the SEED that uniquely
511 :     corresponds to the FIG ID. For example,
512 :     <pre>
513 :     features_around 'sp|Q9TL34'
514 :     </pre>
515 :    
516 :     would display twelve features (<i>fig|31312.1.peg.1</i> through
517 :     <i>fig|31312.1.peg.12</i>, since <i>sp|Q9TL34</i> corresponds to
518 :     <i>fig|31312.1.peg.5</i>. The program outputs a tab-separated file
519 :     of the sort one might use as input to a spreadsheet like Excel. The
520 :     columns in the output contain:
521 :    
522 :     <ol>
523 :     <li>
524 :     the FIG ID,
525 :     <li>
526 :     the location of the feature,
527 :     <li>
528 :     a list of aliases for the gene,
529 :     <li>
530 :     a flag indicating whether or not the gene occurs near the end of the
531 :     contig (and, hence, is potentially truncated),
532 :     <li>
533 :     the function currently assigned to the gene,
534 :     <li>
535 :     the length of the protein sequence encoded by the gene, and
536 :     <li>
537 :     the protein sequence encoded by the gene.
538 :     </ol>
539 :     Of course, some of the features might not be PEGs, and in that case
540 :     the last three fields are just empty.
541 :     <p>
542 :     Now, let's go through some of the details in the code:
543 :     <ul>
544 :     <li>
545 :     The programs begins by trying to locate the FIG ID corresponding to
546 :     the ID given as the command-line argument. The expression
547 :     <pre>
548 :     $fig->by_alias($id)
549 :     </pre>
550 :     returns the FIG ID when it can be determined (otherwise,
551 :     <i>undef</i>).
552 :     <li>
553 :     Once the FIG ID for the PEG has been determined, the location is
554 :     requested using
555 :     <pre>
556 :     $fig->feature_location($peg)
557 :     </pre>
558 :     Remember, a location is a sequence of contiguous regions separated by
559 :     commas (often a single contiguous region); a contiguous region
560 :     amounts to a contig ID, a beginning position, and an ending position
561 :     separated by underscores. In some contexts, all you want is the
562 :     region containing a gene (without the details of intron/exon
563 :     boundaries). In this case, you can use
564 :     <pre>
565 :     ($contig,$beg,$end) = $fig->boundaries_of($loc);
566 :     </pre>
567 :     to convert the location to a single contiguous region. These will
568 :     succeed only if the sequence of contiguous regions all have the same
569 :     contig ID. <i>$beg</i> will be set to the start of the first exon,
570 :     while <i>$end</i> will be set to the end of the last exon.
571 :     <li>
572 :     Once the boundaries of the gene have been determined, the program asks
573 :     for all genes that overlap a region starting 5000 characters (bases,
574 :     if you like) to the left of the gene and ending 5000 characters to the
575 :     right of the gene. This is done by using
576 :     <pre>
577 :     $genome = &FIG::genome_of($peg);
578 :     </pre
579 :     to get the genome of the feature (<i>$fig->genome_of($peg)</i> would
580 :     work, as well), and then using
581 :     <pre>
582 :     ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) =
583 :     $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);
584 :     </pre>
585 :     to extract the set of genes that overlap the region specified by the
586 :     four arguments <i>$genome, $contig, $start_reg, </i>and
587 :     <i>$end_reg</i>. A list of three things is returned by the
588 :     invocation:
589 :     <ol>
590 :     <li>
591 :     <i>$features_in_region</i> will be set to a pointer to a list of the
592 :     features (represented by FIG IDs) that overlap the specified region.
593 :     <li>
594 :     <i>$start_of_leftmost_feature</i> will be set to the "leftmost"
595 :     position occupied by one of the features that is returned.
596 :    
597 :     <i>$end_of_rightmost_feature</i> will be set to the "rightmost"
598 :     position occupied by one of the features that is returned.
599 :     </ol>
600 :     Thus, if you were writing a graphics module to display the features,
601 :     you would need to show the region from
602 :     <i>$start_of_leftmost_feature</i> to <i>$end_of_rightmost_feature</i>
603 :     if you wished to capture the features without truncation.
604 :     <li>
605 :     Once the features have been acquired, the program has a loop that
606 :     simply extracts data about each feature and prints it.
607 :     <pre>
608 :     $loc1 = $fig->feature_location($fid);
609 :     $aliases1 = $fig->feature_aliases($fid);
610 :     $trunc = $fig->possibly_truncated($fid);
611 :     </pre>
612 :     shows how to get the location of a feature, a comma-separated list of
613 :     aliases, and an indication of whether or not the feature might be
614 :     truncated (this is a rather simple check to just see if the feature
615 :     occurs within a few hundred bases from the end of the contig).
616 :     The use of
617 :     <pre>
618 :     &FIG::ftype($fid)
619 :     </pre>
620 :     just extracts the type of the feature (which is almost always
621 :     <i>peg</i> or <i>rna</i>, although we certainly hope that we will have
622 :     a richer set of feature types in the not too distant future).
623 :     The two invocations
624 :     <pre>
625 :     $prot_ln = $fig->translation_length($fid)
626 :     $pseq = $fig->get_translation($fid)
627 :     </pre>
628 :     show how to get the length of the protein sequence encoded by a PEG
629 :     and the actual sequence itself.
630 :     Finally,
631 :     <pre>
632 :     $func = $fig->function_of($fid)
633 :     </pre>
634 :     is used to get the current master function assignment for the gene.
635 :     The SEED does support non-master assignments, although their use has
636 :     been discouraged. the original intent was to allow inexperienced
637 :     users to make assignments that did not overwrite the "master"
638 :     assignment. Users who identify themselves as <i>master:someone</i>
639 :     will only make assignments that overwrite the master assignment.
640 :     However, a user without the "master:" prefix can make an assignment
641 :     that is retained seprately. To access such an assignment, you would
642 :     use
643 :     <pre>
644 :     $func = $fig->function_of($fid,"someone")
645 :     </pre>
646 :     If <i>someone</i> has not made an assignment to <i>$fig</i>, this will
647 :     return the master assignment.
648 :     </ul>
649 :    
650 :     The SEED comes with a directory containing the example programs we are
651 :     displaying. If you find it hard to follow my comments or feel that I
652 :     have failed left out some detail that you need to understand, please
653 :     <ol>
654 :     <li>
655 :     contact me (Ross@TheFIG.info), and
656 :     <li>
657 :     copy the program and play with it. The most basic form of playing is
658 :     to just do things like inserting
659 :     <pre>
660 :     print &Dumper($some_returned_variable);
661 :     </pre>
662 :     to see precisely what comes back.
663 :     </ol>
664 :    
665 :     <h2>Sequence Data</h2>
666 :    
667 :     You have already been introduced to sequence data. You know how to
668 :     access the DNA from a location in a contig from a designated genome,
669 :     you know you to get locations of features, and you know how to acquire
670 :     the translations for PEGs. In this short section we discuss just a
671 :     few utilities that you might find useful.
672 :     <p>
673 :     The next short example illustrates how to access the reverse
674 :     complement of a DNA sequence (i.e., the opposite strand), how to
675 :     translate DNA sequences, and how to write out sequences in FASTA
676 :     format.
677 :     Let us look at it before thinking about additional complexities like
678 :     how to handle nonstandard translations or properly translating the
679 :     first codon:
680 :    
681 :     <hr>
682 :     <pre>
683 :     use FIG;
684 :     my $fig = new FIG;
685 :    
686 :     $usage = "usage: seq_utils Genome Location";
687 :    
688 :     (
689 :     ($genome = shift @ARGV) &&
690 :     ($loc = shift @ARGV)
691 :     )
692 :     || die $usage;
693 :    
694 :     if ($seq = $fig->dna_seq($genome,$loc))
695 :     {
696 :     $seqR = &FIG::reverse_comp($seq);
697 :    
698 :     $tran = &FIG::translate($seq);
699 :     $tranR = &FIG::translate($seqR);
700 :    
701 :     &FIG::display_id_and_seq("plus_strand",\$seq);
702 :     &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
703 :    
704 :     &FIG::display_id_and_seq("minus_strand",\$seqR);
705 :     &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
706 :     }
707 :    
708 :     </pre>
709 :     <h3>Using some simple sequence utilities</h3>
710 :     <hr>
711 :     The program takes two command-line arguments: a
712 :     genome ID and a location.
713 :     The program is supposed to write out the DNA from both the plus and
714 :     minus strands from the given location to STDOUT, and it writes the
715 :     translations of both strands to STDERR.
716 :     Thus,
717 :     <pre>
718 :     seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr
719 :     </pre>
720 :     will write the following contents to the file <i>stdout</i>:
721 :     <pre>
722 :     >plus_strand
723 :     ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG
724 :     GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT
725 :     GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA
726 :     TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA
727 :     CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA
728 :     GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG
729 :     GCTGGTGCAACTGTCACAGTGAAA
730 :     >minus_strand
731 :     TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC
732 :     TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT
733 :     CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG
734 :     CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC
735 :     AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC
736 :     TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT
737 :     AATTTGATTTGTAATAGTAGACAT
738 :     </pre>
739 :     and the following two sequences to the file <i>stderr</i>
740 :     <pre>
741 :     >plus_strand_translated
742 :     MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE
743 :     FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE
744 :     AGATVTVK
745 :     >minus_strand_translated
746 :     FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW
747 :     HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN
748 :     NLICNSRH
749 :     </pre>
750 :     Now let us look at the routines within the program that produced these
751 :     results:
752 :     <ul>
753 :     <li>
754 :     You have already seen the use of something like
755 :     <pre>
756 :     $seq = $fig->dna_seq($genome,$loc)
757 :     </pre>
758 :     to extract the DNA. The actual location can specify sequence from
759 :     either strand (remember how?).
760 :     <li>
761 :     The we use
762 :     <pre>
763 :     $seqR = &FIG::reverse_comp($seq)
764 :     </pre>
765 :     to get the reverse complement of the sequence held in <i>$seq</i>.
766 :     <li>
767 :     The program uses
768 :     <pre>
769 :     $tran = &FIG::translate($seq);
770 :     $tranR = &FIG::translate($seqR);
771 :     </pre>
772 :     to get translations of both strands. Note that these invocations use
773 :     the standard genetic code to do the translation. The <i>translate</i>
774 :     service supports two optional arguments. The second argument is used
775 :     to pass a pointer to a hash containing the desired translation, if you wish to use
776 :     a nonstandard code. The keys of the hash should be
777 :    
778 :     <pre>
779 :     AAA =>
780 :     AAC =>
781 :     AAG =>
782 :     .
783 :     .
784 :     .
785 :     TTT =>
786 :     </pre>
787 :     The values are what the codons are translated to (normally, from the
788 :     set {A,C,G,T}, but it is essentially unrestricted). If the third
789 :     argument evaluates to <i>true</i> (i.e., defined, not null string, and
790 :     not 0), then initial GTG and TTG codons will be translated to "M".
791 :     <li>
792 :     The final little utility we employ is used to write the FASTA files:
793 :     <pre>
794 :     &FIG::display_id_and_seq("plus_strand",\$seq);
795 :     &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR);
796 :    
797 :     &FIG::display_id_and_seq("minus_strand",\$seqR);
798 :     &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
799 :     </pre>
800 :     The first argument is the <i>id</i> you wish, the second is a pointer
801 :     to the sequence to write out, and the third argument (which is
802 :     optional) is a file handle. It defaults to writing to STDOUT.
803 :    
804 :     <h2>Accessing Similarities</h2>
805 :    
806 :     A great deal of comparative analysis relates to use of detected
807 :     similarities. The SEED includes precomputed similarities for each
808 :     protein sequence. Basically these similarities are computed as
809 :     follows:
810 :     <ol>
811 :     <li>
812 :     First, a nonredundant protein sequence data base is formed by taking
813 :     all protein sequences from a number of sources, including the
814 :     organisms loaded into the SEED (but, including sequences from external
815 :     sources that are not found in the SEED organisms). Sets of protein
816 :     sequences that differ only in start calls are collapsed into single
817 :     entries in the nonredundant database (the longer sequences are
818 :     retained, and a table of synonyms is constructed showing which IDs
819 :     correspond to each nonredundant entry, and which of the merged IDs
820 :     actually represent shorter sequences.
821 :     <li>
822 :     Then all of the sequences in the nonredundant database are blasted
823 :     against one another. For each sequence, there may be only a few or a
824 :     great many similarities. We keep only the best similarity between any
825 :     two sequences. Further, we set a maximum number of similarities
826 :     retained for each sequence to "about 300". We try to keep a set that
827 :     includes at least the best similarity between the given sequence and
828 :     sequences from any specific organism. We do not guarantee that we even
829 :     succeed in that. So, you should think of the precomputed similarities
830 :     as a potentially useful, but quite truncated, set. We are moving
831 :     towards a situation in which relationships between closely related sequences are saved in
832 :     the form of trees (and the corresponding similarities are discarded).
833 :     <li>
834 :     When a user requests the similarities to a given PEG, first the SEED
835 :     must figure out which entry in the nonredundant protein sequence
836 :     collection corresponds to the PEG (it might actually be the
837 :     translation corresponding to the PEG, or it might be another sequence
838 :     ID corresponding to a longer sequence from, say, Swiss Prot). Then,
839 :     the block of similarities corresponding to the representative ID is
840 :     extracted. Each of the similarities in this set really represents a
841 :     collection of similarities (one corresponding to each ID in the set of
842 :     sequences that were collapsed into a single representative ID). For
843 :     each of the "raw" similarities, you can choose to expand it into the
844 :     entire set it represents or not, as you wish. The expansion actually
845 :     takes much more time than actually extracting the block of
846 :     similarities, which is an irritating fact (which may be correctable by
847 :     some bright young computer scientists, but for now...). So, it is
848 :     important that you understand the two concepts:
849 :     <ul>
850 :     <li>
851 :     A <i>raw similarity</i> is a similarity between two members of the
852 :     nonredundant protein collection.
853 :     <li>
854 :     The act of <i>expanding a similarity</i> amounts to taking a
855 :     similarity between the given ID (ususally called <i>id1</i>) and
856 :     another (usually called <i>id2</i>) and expanding it by considering
857 :     all ids that were collapsed to <i>id2</i>.
858 :     </ul>
859 :     <li>
860 :     Finally, you need to know that when you request similarities against a
861 :     given id, you get to specify how many of the similarities you wish
862 :     expanded. This is useful, since one might wish to see the expanded
863 :     set for, say, the closest five, while getting several hundred
864 :     unexpanded similarities.
865 :     </ol>
866 :     <br>
867 :     Here is a short program that illustrates some of these comments. It
868 :     takes as input arguments
869 :     <ol>
870 :     <li>
871 :     a given PEG id,
872 :     <li>
873 :     a similarity threshhold (minimum p-score),
874 :     <li>
875 :     the maximum number of similarities to be returned, and
876 :     <li>
877 :     the number of similarities to expand.
878 :     </ol>
879 :     For each similarity, it displays the
880 :     <ul>
881 :     <li> the length of the given PEG,
882 :     <li> id2 (the id of the similar sequence),
883 :     <li> the length of id2,
884 :     <li> the p-score,
885 :     <li> the perecent identity,
886 :     <li> the region of similarity, and
887 :     <li> the function assigned to the similar sequence.
888 :     </ul>
889 :    
890 :     <hr>
891 :     <pre>
892 :     use FIG;
893 :     my $fig = new FIG;
894 :     use Sim;
895 :    
896 :     use strict;
897 :    
898 :     my($usage,$peg,$cutoff,$max,$expand);
899 :     my(@sims,$sim,$id2,$pscore,$iden,$b1,$e1,$b2,$e2,$ln1,$ln2,$func1,$func2);
900 :    
901 :     $usage = "usage: show_similarities PEG CutOff MaxReturned Expand";
902 :    
903 :     (
904 :     ($peg = shift @ARGV) &&
905 :     ($cutoff = shift @ARGV) &&
906 :     ($max = shift @ARGV) &&
907 :     defined($expand = shift @ARGV)
908 :     )
909 :     || die $usage;
910 :    
911 :     @sims = $fig->sims($peg,$max,$cutoff,"all",$expand);
912 :     foreach $sim (@sims)
913 :     {
914 :     $ln1 = $sim->ln1;
915 :     $id2 = $sim->id2;
916 :     $ln2 = $sim->ln2;
917 :     $pscore = $sim->psc;
918 :     $iden = $sim->iden;
919 :     $b1 = $sim->b1;
920 :     $e1 = $sim->e1;
921 :     $b2 = $sim->b2;
922 :     $e2 = $sim->e2;
923 :     $func2 = $fig->function_of($id2);
924 :     print join("\t",($ln1,$id2,$ln2,$pscore,$iden,$b1,$e1,$b2,$e2,$func2)),"\n";
925 :     }
926 :     </pre>
927 :    
928 :     <h3>How to Access Similarities</h3>
929 :     <hr>
930 :     If youn were to run this program using
931 :     <pre>
932 :     show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 0
933 :     </pre>
934 :     it would produce an output file that started with the following lines:
935 :     <pre>
936 :     310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
937 :     310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
938 :     310 pirnr|NF01089494 310 2e-177 99.35 1 310 1 310 Homoserine kinase
939 :     310 pirnr|NF01137671 310 2e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK)
940 :     310 fig|216599.1.peg.3713 310 3e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
941 :     </pre>
942 :     This output amounts to the first five unexpanded similarities. If you
943 :     were to run
944 :     <pre>
945 :     show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 2
946 :     </pre>
947 :     it would produce an output file that started with the following lines:
948 :     <pre>
949 :     310 sp|P00547 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
950 :     310 pirnr|NF00702903 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK)
951 :     310 kegg|eco:b0003 310 0 100 1 310 1 310 homoserine kinase (HK) [EC:2.7.1.39]
952 :     310 kegg|ecj:JW0002 310 0 100 1 310 1 310 Homoserine kinase [EC:2.7.1.39]
953 :     310 pirnr|NF01368612 310 0 100 1 310 -18 291 homoserine kinase
954 :     310 gi|33383907 310 0 100 1 310 -18 291 homoserine kinase
955 :     310 gi|16127997 310 0 100 1 310 1 310 homoserine kinase
956 :     310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
957 :     310 fig|216593.1.peg.3578 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
958 :     310 fig|216598.1.peg.2077 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
959 :     310 gi|10186209 310 4e-178 99.68 1 310 -21 288 homoserine kinase
960 :     310 gi|33383857 310 4e-178 99.68 1 310 -18 291 homoserine kinase
961 :     310 pirnr|NF00706256 310 4e-178 99.68 1 310 -21 288 Homoserine kinase (Fragment)
962 :     310 pirnr|NF01368414 310 4e-178 99.68 1 310 -18 291 homoserine kinase
963 :     310 tr|Q9ETA5 310 4e-178 99.68 1 310 -21 288 Homoserine kinase
964 :     310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
965 :     310 fig|155864.1.peg.3 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)
966 :     310 kegg|ece:Z0003 310 9e-178 99.35 1 310 1 310 homoserine kinase [EC:2.7.1.39]
967 :     </pre>
968 :     The first seven lines of this expanded set of similarities amount to
969 :     identical matches against the set of sequences that were collapsed
970 :     into a single entry in the nonredundant database (i.e., sequences that
971 :     were essentially identical with <i>fig|83333.1.peg.3</i>). The next
972 :     eight similarities correspond to the first "raw" similarity, and so
973 :     forth.
974 :     <br>
975 :     Now let us examine the program:
976 :     <ul>
977 :     <li>Use the <i>Sim.pm</i> module, if you are processing similarities.
978 :     That is, include
979 :     <pre>
980 :     use Sim;
981 :     </pre>
982 :     at the head of your program.
983 :    
984 :     <li>
985 :     The invocation
986 :     <pre>
987 :     $fig->sims($peg,$max,$cutoff,"all",$expand)
988 :     </pre>
989 :     takes several parameters (the given PEG, the maximum number of
990 :     similarities to return, the p-score cutoff, a "selection" option, and
991 :     the number of raw similarities to expand).
992 :     Itb returns a list of objects of type <i>similarity</i>.
993 :     The "selection" option can
994 :     be
995 :     <ul>
996 :     <li><i>raw</i> to get just raw similarities (equivalent to expanding
997 :     0)
998 :     <li><i>all</i> returns all similarities,
999 :     <li><i>fig</i> returns only similarities against FIG ids,
1000 :     <li><i>external</i> returns only similarities against ids that are not
1001 :     FIG ids,
1002 :     </ul>
1003 :     To understand thye impact of the parameters, I suggest that you take
1004 :     the simple program and run it with altered parameters, or just use the
1005 :     SEED web services to see what happens.
1006 :    
1007 :    
1008 :     <li>
1009 :     To extract values relating to the similarity, you would use
1010 :     assignments as shown:
1011 :     <pre>
1012 :     $ln1 = $sim->ln1;
1013 :     $id2 = $sim->id2;
1014 :     $ln2 = $sim->ln2;
1015 :     $pscore = $sim->psc;
1016 :     $iden = $sim->iden;
1017 :     $b1 = $sim->b1;
1018 :     $e1 = $sim->e1;
1019 :     $b2 = $sim->b2;
1020 :     $e2 = $sim->e2;
1021 :     </pre>
1022 :     </ul>
1023 :    
1024 :     There is a great deal more that could be said about similarities, but
1025 :     for the purposes of this tutorial, that is it.
1026 :    
1027 :     <h2>Functional Coupling</h2>
1028 :    
1029 :     When I say that two genes are "functionally coupled", I mean that they
1030 :     play closely related roles. They may be multiple subunits of the same
1031 :     complex, they may be enzymes from the same pathway, they may be
1032 :     co-regulated, or whatever. Functional coupling can be inferred using
1033 :     many different technologies:
1034 :     <ul>
1035 :     <li>detection of preserved contiguity on chromosomes,
1036 :     <li>analysis of gene fusion event,
1037 :     <li>detection of similar occurrence profiles,
1038 :     <li>creation of protein-protein interaction data,
1039 :     <li>detection of common transcription regulatory sites, and
1040 :     <li>detection of common expression profiles
1041 :     </ul>
1042 :     While the SEED will eventually support all of these techniques (and,
1043 :     now does support the first three), support for detection of preserved
1044 :     contiguity is the only one that is built into the kernel.
1045 :     Hence, this discussion will focus on detection of functional coupling
1046 :     via identification of preserved contiguity of genes.
1047 :     <p>
1048 :     The SEED supports one central role that can be used to summarize
1049 :     preserved contiguity:
1050 :     The following short program illustrates its use:
1051 :    
1052 :     <hr>
1053 :     <pre>
1054 :     use FIG;
1055 :     my $fig = new FIG;
1056 :     use strict;
1057 :    
1058 :     my($usage,$bound,$sim_cutoff,$fc_cutoff,@fc,$tuple,$peg);
1059 :    
1060 :     $usage = "usage: fc Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs";
1061 :    
1062 :     (
1063 :     ($bound = shift @ARGV) &&
1064 :     ($sim_cutoff = shift @ARGV) &&
1065 :     ($fc_cutoff = shift @ARGV)
1066 :     )
1067 :     || die $usage;
1068 :    
1069 :     while (defined($_ = <STDIN>))
1070 :     {
1071 :     if ($_ =~ /^(fig\|\d+\.\d+\.peg\.\d+)/)
1072 :     {
1073 :     $peg = $1;
1074 :     @fc = $fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
1075 :     foreach $tuple (@fc)
1076 :     {
1077 :     print &Dumper($tuple);
1078 :     }
1079 :     }
1080 :     }
1081 :     </pre>
1082 :    
1083 :     <h3>How to compute functional coupling scores based on preserved contiguity</h3>
1084 :     <hr>
1085 :     The short program simply reads in a file containing PEG ids. For each
1086 :     PEG, it invokes
1087 :     <pre>
1088 :     @fc =$fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);
1089 :     </pre>
1090 :    
1091 :     to compute a list of 3-tuples. Each 3-tuple corresponds to a
1092 :     functionally coupled gene. The 3-tuple contains
1093 :     <ol>
1094 :     <li> the number of occurrences of preservation of homologous pairs
1095 :     that were detected in somewhat diverged genomes,
1096 :     <li> the PEG id of a functionally coupled gene, and
1097 :     <li> a pointer to a list of 2-tuples that describe the pairs in
1098 :     different genomes.
1099 :     </ol>
1100 :     If the program is run using the following command
1101 :     <pre>
1102 :     show_functional_coupling 5000 1.0e-10 2 < tmp
1103 :     </pre>
1104 :     and if <i>tmp</i> contains a single line of the form
1105 :     <pre>
1106 :     fig|83333.1.peg.3
1107 :     </pre>
1108 :     then the program will produce the following output:
1109 :     <pre>
1110 :     $VAR1 = [
1111 :     77,
1112 :     'fig|83333.1.peg.2',
1113 :     [
1114 :     [
1115 :     'fig|263.1.peg.1622',
1116 :     'fig|263.1.peg.1624'
1117 :     ],
1118 :     [
1119 :     'fig|263.1.peg.1622',
1120 :     'fig|263.1.peg.1625'
1121 :     ],
1122 :     [
1123 :     'fig|562.2.peg.2720',
1124 :     'fig|562.2.peg.2721'
1125 :     ],
1126 :     [
1127 :     'fig|573.2.peg.4598',
1128 :     'fig|573.2.peg.4594'
1129 :     ],
1130 :     .
1131 :     . <<< deleted lines showing more pairs >>>
1132 :     .
1133 :     ]
1134 :     ];
1135 :     $VAR1 = [
1136 :     44,
1137 :     'fig|83333.1.peg.4',
1138 :     [
1139 :     [
1140 :     'fig|263.1.peg.1622',
1141 :     'fig|263.1.peg.1620'
1142 :     ],
1143 :     [
1144 :     'fig|562.2.peg.2526',
1145 :     'fig|562.2.peg.2527'
1146 :     ],
1147 :     [
1148 :     'fig|562.2.peg.2521',
1149 :     'fig|562.2.peg.2522'
1150 :     ],
1151 :     [
1152 :     'fig|562.2.peg.2525',
1153 :     'fig|562.2.peg.2527'
1154 :     ],
1155 :     [
1156 :     'fig|573.2.peg.4598',
1157 :     'fig|573.2.peg.4599'
1158 :     ],
1159 :     .
1160 :     . <<< deleted lines showing more pairs >>>
1161 :     .
1162 :     ]
1163 :     ];
1164 :     $VAR1 = [
1165 :     10,
1166 :     'fig|83333.1.peg.6',
1167 :     [
1168 :     [
1169 :     'fig|562.2.peg.2526',
1170 :     'fig|562.2.peg.2529'
1171 :     ],
1172 :     [
1173 :     'fig|562.2.peg.2525',
1174 :     'fig|562.2.peg.2529'
1175 :     ],
1176 :     .
1177 :     . <<< deleted lines showing more pairs >>>
1178 :     .
1179 :     ]
1180 :     ];
1181 :     $VAR1 = [
1182 :     7,
1183 :     'fig|83333.1.peg.7',
1184 :     [
1185 :     [
1186 :     'fig|562.2.peg.2526',
1187 :     'fig|562.2.peg.2531'
1188 :     ],
1189 :     [
1190 :     'fig|562.2.peg.2526',
1191 :     'fig|562.2.peg.2530'
1192 :     ],
1193 :     [
1194 :     'fig|562.2.peg.2525',
1195 :     'fig|562.2.peg.2531'
1196 :     ],
1197 :     [
1198 :     'fig|562.2.peg.2525',
1199 :     'fig|562.2.peg.2530'
1200 :     ],
1201 :     [
1202 :     'fig|573.2.peg.4598',
1203 :     'fig|573.2.peg.4604'
1204 :     ],
1205 :     .
1206 :     . <<< deleted lines showing more pairs >>>
1207 :     .
1208 :     ]
1209 :     ];
1210 :     $VAR1 = [
1211 :     7,
1212 :     'fig|83333.1.peg.8',
1213 :     [
1214 :     [
1215 :     'fig|594.1.peg.4954',
1216 :     'fig|594.1.peg.4958'
1217 :     ],
1218 :     [
1219 :     'fig|630.2.peg.581',
1220 :     'fig|630.2.peg.584'
1221 :     ],
1222 :     [
1223 :     'fig|12149.1.peg.2',
1224 :     'fig|12149.1.peg.6'
1225 :     ],
1226 :     .
1227 :     . <<< deleted lines showing more pairs >>>
1228 :     .
1229 :     ]
1230 :     ];
1231 :     </pre>
1232 :     If you simply replaced
1233 :     <pre>
1234 :     print &Dumper($tuple);
1235 :     </pre>
1236 :     with
1237 :     <pre>
1238 :     print "$peg\t$tuple->[0]\t$tuple->[1]\n";
1239 :     </pre>
1240 :     you would get
1241 :     <pre>
1242 :     fig|83333.1.peg.3 77 fig|83333.1.peg.2
1243 :     fig|83333.1.peg.3 44 fig|83333.1.peg.4
1244 :     fig|83333.1.peg.3 10 fig|83333.1.peg.6
1245 :     fig|83333.1.peg.3 7 fig|83333.1.peg.7
1246 :     fig|83333.1.peg.3 7 fig|83333.1.peg.8
1247 :    
1248 :     </pre>
1249 :     which might be closer to what most end users wish to see.
1250 :     <p>
1251 :     The only detail that might be worth noting is the scoring function.
1252 :     It counts the number of pairs (besides the pair composed of the given
1253 :     PEG and the related PEG) that could be located, where "essentially
1254 :     identical" occurrences are collapsed to a single occurrence. For
1255 :     example, if you had 10 <i>E.coli</i> genomes (of closely related
1256 :     strains), you would wish to count them as a single occurrence.
1257 :     Basically, we collapse occurrences between genes that are encode
1258 :     protein sequences that are more than 97% identical to a single
1259 :     occurrence. So, when you see
1260 :     <pre>
1261 :     fig|83333.1.peg.3 77 fig|83333.1.peg.2
1262 :     </pre>
1263 :     This means that there are genes encoding proteins similar to those
1264 :     encoded by fig|83333.1.peg.3 and fig|83333.1.peg.2 that are within
1265 :     5000 bases of one another in 77 quite distinct genomes.
1266 :    
1267 :     <h2>Maps, Subsystems, Functional Roles, and gene Functions</h2>
1268 :    
1269 :     The SEED currently supports four distinct notions that I attempt to
1270 :     differentiate in ths section. My working definitions would be as
1271 :     follows:
1272 :     <ol>
1273 :     <li>
1274 :     A <b>map</b> is a relatively large set of related functional roles.
1275 :     The notion grew out of working with KEGG maps. It encompasses larger
1276 :     functional units than the related notion <b>pathway</b>.
1277 :     <li>The notion of <b>subsystem</b> is a slight extension of the
1278 :     concept <b>pathway</b>. It is a set of functional roles that together
1279 :     play a related function. A pathway is a metabolic subsystem. A
1280 :     ribosome is a nonmetabolic subsystem.
1281 :     <p>
1282 :     Since KEGG is doing such a useful job in attempting to curate maps, we
1283 :     download their maps and use them. We supplement them occasionally
1284 :     with some drawn by ourselves, but we do think that the community owes
1285 :     KEGG real gratitude for making these maps available.
1286 :    
1287 :     <li>My notion of <b>functional role</b> is roughly an abstraction of
1288 :     the function of a gene (i.e., a monfunctional gene). It is a
1289 :     well-defined component of a subsystem. Variants of subsystems should
1290 :     be thought of as sets of functional roles (i.e., not as sets of
1291 :     genes).
1292 :    
1293 :     <li>A <b>gene function</b> may be thought of as a set of functional
1294 :     roles. Often there is just one, but many genes are multifunctional.
1295 :     Somestimes this means that a single domain of the protein encoded by
1296 :     the gene has low specificity which allows the gene to play multiple
1297 :     roles, and sometimes the gene has several domains with each domain
1298 :     implementing a complete functional role. Either way, if a gene
1299 :     function implements multiple roles, it should have the form
1300 :     <pre>
1301 :     role1 / role2 / role3...
1302 :     </pre>
1303 :     That is, the gene function should be made up of a list of roles
1304 :     separated by " / ".
1305 :     A gene function is said to <i>connect</i> to each of the roles that
1306 :     make it up (those roles that are separated by " / "), as well as any
1307 :     embedded EC numbers. Thus, the gene function
1308 :     <pre>
1309 :     Pyruvate kinase (EC 2.7.1.40) / 6-phosphofructokinase (EC 2.7.1.11)
1310 :     </pre>
1311 :     would connect to
1312 :     <ol>
1313 :     <li>Pyruvate kinase (EC 2.7.1.40),
1314 :     <li>6-phosphofructokinase (EC 2.7.1.11),
1315 :     <li>2.7.1.40, and
1316 :     <li>2.7.1.11
1317 :     </ol>
1318 :     </ol>
1319 :     <p>
1320 :     Normally, when we speak of maps, most of the functional roles are
1321 :     simply EC numbers (although some are not). When we speak of
1322 :     subsystems, the functional roles are almost never specified as just an
1323 :     EC number.
1324 :     <p>
1325 :     Our first simple example program takes as command line arguments a map
1326 :     ID (i.e., one of the KEGG map numbers -- these can be taken off the
1327 :     initial page displayed by the SEED web services) and a genome ID. It
1328 :     displays the map ID and the "map name". Then, it lists the roles
1329 :     contained within the map, and for each role it displays the PEGs
1330 :     associated with the role. Thus,
1331 :    
1332 :     <pre>
1333 :     connect_map_to_genes MAP00010 83333.1
1334 :     </pre>
1335 :     produced the following output for me:
1336 :     <pre>
1337 :     MAP00010 Glycolysis / Gluconeogenesis
1338 :     2.7.1.11
1339 :     fig|83333.1.peg.1706 6-phosphofructokinase II (EC 2.7.1.11)
1340 :     fig|83333.1.peg.3835 6-phosphofructokinase (EC 2.7.1.11)
1341 :     2.7.1.41
1342 :     1.2.1.3
1343 :     fig|83333.1.peg.3522 Aldehyde dehydrogenase (EC 1.2.1.3)
1344 :     5.4.2.1
1345 :     fig|83333.1.peg.3548 2,3-bisphosphoglycerate-independent phosphoglycerate mutase (EC 5.4.2.1)
1346 :     fig|83333.1.peg.4303 Phosphoglycerate mutase (EC 5.4.2.1)
1347 :     fig|83333.1.peg.741 Phosphoglycerate mutase (EC 5.4.2.1)
1348 :     3.1.3.10
1349 :     fig|83333.1.peg.989 Glucose-1-phosphatase precursor (EC 3.1.3.10)
1350 :     1.1.1.27
1351 :     fig|83333.1.peg.787 L-lactate dehydrogenase (EC 1.1.1.27)
1352 :     1.8.1.4
1353 :     fig|83333.1.peg.116 Dihydrolipoamide dehydrogenase (EC 1.8.1.4)
1354 :     3.6.1.7
1355 :     fig|83333.1.peg.953 Acylphosphatase (EC 3.6.1.7)
1356 :     2.7.1.40
1357 :     fig|83333.1.peg.1660 Pyruvate kinase (EC 2.7.1.40)
1358 :     fig|83333.1.peg.1838 Pyruvate kinase (EC 2.7.1.40)
1359 :     2.7.1.1
1360 :     3.1.3.9
1361 :     4.1.1.1
1362 :     5.1.3.15
1363 :     1.1.99.8
1364 :     1.2.4.1
1365 :     fig|83333.1.peg.114 Pyruvate dehydrogenase E1 component (EC 1.2.4.1)
1366 :     5.1.3.3
1367 :     fig|83333.1.peg.742 Aldose 1-epimerase (EC 5.1.3.3)
1368 :     6.2.1.1
1369 :     fig|83333.1.peg.3979 Acetyl-coenzyme A synthetase (EC 6.2.1.1)
1370 :     5.4.2.4
1371 :     2.7.1.69
1372 :     fig|83333.1.peg.1086 PTS system, glucose-specific IIBC component (EIIBC-Glc) (Glucose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1373 :     fig|83333.1.peg.1607 PTS system, maltose and glucose-specific IIABC component (Maltose and glucose-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1374 :     fig|83333.1.peg.1719 PTS system, cellobiose-specific IIA component (EIIA-Cel) (Cellobiose- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1375 :     fig|83333.1.peg.1721 PTS system, cellobiose-specific IIB component (EIIB-Cel) (Cellobiose- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1376 :     fig|83333.1.peg.1800 PTS system, mannose-specific IIAB component (EC 2.7.1.69)
1377 :     fig|83333.1.peg.2068 PTS system, galactitol-specific IIB component (EIIB-GAT) (Galacticol- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1378 :     fig|83333.1.peg.2069 PTS system, galactitol-specific IIA component (EIIA-GAT) (Galacticol- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1379 :     fig|83333.1.peg.2142 PTS system, fructose-specific IIBC component (EIIBC-Fru) (Fructose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1380 :     fig|83333.1.peg.2144 PTS system, fructose-specific IIA/FPr component (EIIA-Fru) (Fructose- permease IIA/FPr component) (Phosphotransferase enzyme II, A/FPr component) (EC 2.7.1.69)
1381 :     fig|83333.1.peg.2165 PTS system, mannitol-specific IIBC component (EC 2.7.1.69)
1382 :     fig|83333.1.peg.2360 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1383 :     fig|83333.1.peg.2385 PTS system, glucose-specific IIA component (EC 2.7.1.69)
1384 :     fig|83333.1.peg.2657 Sorbitol PTS, EIIC (EC 2.7.1.69)
1385 :     fig|83333.1.peg.2658 PTS system, glucitol/sorbitol-specific IIBC component (EC 2.7.1.69)
1386 :     fig|83333.1.peg.2659 PTS system, glucitol/sorbitol-specific IIA component (EIIA-gut) (Glucitol/sorbitol-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1387 :     fig|83333.1.peg.2670 PTS system, arbutin-, cellobiose-, and salicin-specific IIABC component (EIIABC-ASC) (Arbutin-, cellobiose-, and salicin-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1388 :     fig|83333.1.peg.2884 PTS system, mannitol (Cryptic) -specific IIBC component (EIIBC-(C)MTL) (Mannitol (Cryptic)-permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1389 :     fig|83333.1.peg.2885 PTS system, mannitol (Cryptic) -specific IIA component (EIIA-(C)MTL) (Mannitol (Cryptic)-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69)
1390 :     fig|83333.1.peg.3083 PTS system, N-acetylgalactosamine-specific IIB component 1 (EIIB-AGA) (N-acetylgalactosamine-permease IIB component 1) (Phosphotransferase enzyme II, B component 1) (EC 2.7.1.69)
1391 :     fig|83333.1.peg.3147 Nitrogen regulatory IIA protein (EC 2.7.1.69)
1392 :     fig|83333.1.peg.3534 PTS system, mannitol-specific IIABC component (EIIABC-MTL) (Mannitol- permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1393 :     fig|83333.1.peg.3659 PTS system, beta-glucoside-specific IIABC component (EIIABC-Bgl) (Beta-glucoside-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69)
1394 :     fig|83333.1.peg.3820 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1395 :     fig|83333.1.peg.3821 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
1396 :     fig|83333.1.peg.3869 PTS system, fructose-specific IIBC component (EC 2.7.1.69)
1397 :     fig|83333.1.peg.3870 PTS system, fructose-like-2 IIB component 1 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1398 :     fig|83333.1.peg.3873 PTS system, fructose-like-2 IIB component 2 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69)
1399 :     fig|83333.1.peg.4150 PTS system, trehalose-specific IIBC component (EIIBC-TRE) (Trehalose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69)
1400 :     fig|83333.1.peg.4213 PTS system, galactitol-specific IIC component (EC 2.7.1.69)
1401 :     fig|83333.1.peg.669 PTS system, glucose-specific IIABC component (EC 2.7.1.69)
1402 :     fig|83333.1.peg.723 PTS system, fructose-specific IIABC component (EC 2.7.1.69)
1403 :     1.1.1.71
1404 :     1.1.1.1
1405 :     fig|83333.1.peg.1228 Aldehyde-alcohol dehydrogenase [Includes: Alcohol dehydrogenase (EC 1.1.1.1) (ADH); Acetaldehyde dehydrogenase [acetylating] (EC 1.2.1.10) (ACDH); Pyruvate-formate-lyase deactivase (PFL deactivase)]
1406 :     fig|83333.1.peg.1301 Alcohol dehydrogenase (EC 1.1.1.1)
1407 :     fig|83333.1.peg.3523 Alcohol dehydrogenase (EC 1.1.1.1)
1408 :     fig|83333.1.peg.353 Alcohol dehydrogenase class III (EC 1.1.1.1)
1409 :     2.7.1.63
1410 :     3.2.1.86
1411 :     fig|83333.1.peg.1717 6-phospho-beta-glucosidase (EC 3.2.1.86)
1412 :     fig|83333.1.peg.2853 6-phospho-beta-glucosidase bglA (EC 3.2.1.86)
1413 :     fig|83333.1.peg.3658 6-phospho-beta-glucosidase bglB (EC 3.2.1.86)
1414 :     3.1.6.3
1415 :     1.2.1.12
1416 :     fig|83333.1.peg.1762 Glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12)
1417 :     2.7.2.3
1418 :     fig|83333.1.peg.2877 Phosphoglycerate kinase (EC 2.7.2.3)
1419 :     5.3.1.9
1420 :     fig|83333.1.peg.3934 Glucose-6-phosphate isomerase (EC 5.3.1.9)
1421 :     1.2.1.5
1422 :     1.2.1.51
1423 :     4.2.1.11
1424 :     fig|83333.1.peg.2735 Enolase (EC 4.2.1.11)
1425 :     5.3.1.1
1426 :     fig|83333.1.peg.3838 Triosephosphate isomerase (EC 5.3.1.1)
1427 :     3.1.3.13
1428 :     4.1.2.13
1429 :     fig|83333.1.peg.1504 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1430 :     fig|83333.1.peg.2072 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1431 :     fig|83333.1.peg.2876 Fructose-bisphosphate aldolase (EC 4.1.2.13)
1432 :     2.7.1.2
1433 :     fig|83333.1.peg.2361 Glucokinase (EC 2.7.1.2)
1434 :     2.3.1.12
1435 :     fig|83333.1.peg.115 Dihydrolipoamide acetyltransferase component of pyruvate dehydrogenase complex (EC 2.3.1.12) (E2)
1436 :     1.1.1.2
1437 :     3.1.3.11
1438 :     fig|83333.1.peg.2881 Fructose-1,6-bisphosphatase (EC 3.1.3.11)
1439 :     fig|83333.1.peg.4142 Fructose-1,6-bisphosphatase (EC 3.1.3.11)
1440 :     5.4.2.2
1441 :     fig|83333.1.peg.678 Phosphoglucomutase (EC 5.4.2.2)
1442 :     </pre>
1443 :    
1444 :     Now let's examine the short program that produced the output:
1445 :     <hr>
1446 :     <pre>
1447 :     use FIG;
1448 :     my $fig = new FIG;
1449 :     use Sim;
1450 :    
1451 :     use strict;
1452 :     my($usage,$map,$genome,$name,$role,$peg,$func,$expanded_role);
1453 :    
1454 :     $usage = "usage: connect_map_to_genes Map Genome";
1455 :    
1456 :     (
1457 :     ($map = shift @ARGV) &&
1458 :     ($genome = shift @ARGV)
1459 :     )
1460 :     || die $usage;
1461 :    
1462 :     $name = $fig->map_name($map);
1463 :     print "$map\t$name\n";
1464 :     foreach $role ($fig->map_to_roles($map))
1465 :     {
1466 :     print " $role\n";
1467 :     foreach $peg ($fig->seqs_with_role($role,"master",$genome))
1468 :     {
1469 :     $func = $fig->function_of($peg);
1470 :     print " $peg\t$func\n";
1471 :     }
1472 :     }
1473 :     </pre>
1474 :    
1475 :     <h3>Locating genes within an organism that connect to a specific map</h3>
1476 :     <hr>
1477 :     There arev a number of details worth noting in this little program:
1478 :     <ul>
1479 :     <li> the invocation
1480 :     <pre>
1481 :     $fig->map_name($map)
1482 :     </pre>
1483 :     returns the full name associated with the map.
1484 :     <li>You can get the the list of functional roles contained within a map using
1485 :     <pre>
1486 :     $fig->map_to_roles($map)
1487 :     </pre>
1488 :     <li>Because the roles within a map are often just EC numbers, you will
1489 :     usually wish to "expand" them by attaching a common name for the
1490 :     function represented by the EC number. You can do this using
1491 :     <pre>
1492 :     $expanded_role = $fig->expand_ec($role)
1493 :     </pre>
1494 :     I did not do that in this example, since I wanted you to see the EC
1495 :     numbers being used as functional roles in maps. You should modify the
1496 :     program to expand the ECs and rerun it.
1497 :     <li>
1498 :     Finally, you can get a list of the PEGs within a genome that have gene functions
1499 :     which connect to a specific functional role using
1500 :     <pre>
1501 :     $fig->seqs_with_role($role,"master",$genome)
1502 :     </pre>
1503 :     </ul>
1504 :     Since maps and subsystems are remarkably similar notions (both are
1505 :     sets of functional roles), you could also construct a similar program
1506 :     for subsystems. In the SEED, subsystems do not have IDs which are
1507 :     associated with full names; rather, the full name is itself the ID of
1508 :     the subsystem. With this in mind, the following program should be
1509 :     reasonably clear:
1510 :    
1511 :     <hr>
1512 :     <pre>
1513 :     use FIG;
1514 :     my $fig = new FIG;
1515 :     use Sim;
1516 :    
1517 :     use strict;
1518 :     my($usage,$subsystem,$genome,$name,$role,$peg,$func);
1519 :    
1520 :     $usage = "usage: connect_subsystem_to_genes Subsystem Genome";
1521 :    
1522 :     (
1523 :     ($subsystem = shift @ARGV) &&
1524 :     ($genome = shift @ARGV)
1525 :     )
1526 :     || die $usage;
1527 :    
1528 :     print "$subsystem\n";
1529 :     foreach $role ($fig->subsystem_to_roles($subsystem))
1530 :     {
1531 :     print " $role\n";
1532 :     foreach $peg ($fig->seqs_with_role($role,"master",$genome))
1533 :     {
1534 :     $func = $fig->function_of($peg);
1535 :     print " $peg\t$func\n";
1536 :     }
1537 :     }
1538 :     </pre>
1539 :    
1540 :     <h3>Locating genes within an organism that connect to a specific subsystem</h3>
1541 :     <hr>
1542 :     Invoking this program using
1543 :     <pre>
1544 :     connect_subsystem_to_genes 'Embden-Meyerhof and Gluconogenesis' 83333.1
1545 :     </pre>
1546 :     would produce the following output:
1547 :     <pre>
1548 :     </pre>
1549 :     Note that the functional roles in subsystems tend to be far more
1550 :     detailed (and almost never just an EC number).
1551 :     <p>
1552 :     The problem with the above program is that we have chosen to view a
1553 :     subsystem as a spreadsheet in which each cell corresponds to a
1554 :     specific functional role in a designated organism. The gene functions
1555 :     of the PEGs contained in the cell usually connect to the specific
1556 :     role, but this is not forced. If someone changes the function of a
1557 :     gene, it does not disappear from the spreadsheet, unless the
1558 :     spreadsheet is rebuilt by forcing the entries to reflect current
1559 :     assignments and how they connect to the roles. We have chosen this
1560 :     approach to allow curators of spreadsheets to have precise control
1561 :     over the contents. People who import a subsystem may or may not
1562 :     instal the gene assignments needed to make the spreadsheet
1563 :     consistent, but the spreadsheet reflects the curators best assessment
1564 :     of which genes implement specific roles. This leads to my assertion
1565 :     that the following program is a better way to display a subsystem:
1566 :    
1567 :     <hr>
1568 :     <pre>
1569 :     use FIG;
1570 :     my $fig = new FIG;
1571 :     use Sim;
1572 :    
1573 :     use strict;
1574 :     my($usage,$subsystem,$genome,$name,$role,$peg,$func);
1575 :    
1576 :     $usage = "usage: connect_subsystem_to_genes_the_right_way Subsystem Genome";
1577 :    
1578 :     (
1579 :     ($subsystem = shift @ARGV) &&
1580 :     ($genome = shift @ARGV)
1581 :     )
1582 :     || die $usage;
1583 :    
1584 :     print "$subsystem\n";
1585 :     foreach $role ($fig->subsystem_to_roles($subsystem))
1586 :     {
1587 :     print " $role\n";
1588 :     foreach $peg ($fig->pegs_in_subsystem_cell($genome,$role))
1589 :     {
1590 :     $func = $fig->function_of($peg);
1591 :     print " $peg\t$func\n";
1592 :     }
1593 :     }
1594 :     </pre>
1595 :    
1596 :     <h3>Locating genes within an organism that connect to a specific
1597 :     subsystem the right way</h3>
1598 :     <hr>
1599 :    
1600 :     <p>
1601 :     We have just illustrated how to connect maps to functional roles and
1602 :     functional roles to gene functions. Now we need to go the other way.
1603 :     That is, given a gene function, what roles would it
1604 :     connect to, and which maps and subsystems contain those roles?
1605 :     The following program illustrates the services required to answer this
1606 :     question;
1607 :    
1608 :     <hr>
1609 :     <pre>
1610 :     use FIG;
1611 :     my $fig = new FIG;
1612 :    
1613 :     use strict;
1614 :     my($usage,$peg,$func,@roles,$role,@maps,$map);
1615 :     my(%in_map,@maps_it_connects_to,@subsystems_it_is_in);
1616 :     my($map_name,$subsystem);
1617 :    
1618 :     $usage = "usage: gf2maps_and_subs PEG";
1619 :    
1620 :     # The single input argument must be a FIG id, which is used to get the assigned function.
1621 :     # The program then prints a list of maps to which the gene can be connected,
1622 :     # followed by a list of subsystems that contain the gene.
1623 :    
1624 :     ($peg = shift @ARGV)
1625 :     || die $usage;
1626 :    
1627 :     if (($func = $fig->function_of($peg)) && (! &FIG::hypo($func)))
1628 :     {
1629 :     @roles = $fig->roles_of_function($func);
1630 :     foreach $role (@roles)
1631 :     {
1632 :     @maps = $fig->role_to_maps($role);
1633 :     foreach $map (@maps)
1634 :     {
1635 :     $in_map{$map} = 1;
1636 :     }
1637 :     }
1638 :    
1639 :     @maps_it_connects_to = sort keys(%in_maps);
1640 :    
1641 :     if (@maps_it_connects_to > 0)
1642 :     {
1643 :     print "Maps that connect to $func\n\n";
1644 :     foreach $map (@maps_it_connects_to)
1645 :     {
1646 :     $map_name = $fig->map_name($map);
1647 :     print "$map\t$map_name\n";
1648 :     }
1649 :     print "\n";
1650 :     }
1651 :    
1652 :     }
1653 :     else
1654 :     {
1655 :     print STDERR "you need to give a valid PEG with a non-hypothetical function to connect to maps\n";
1656 :     }
1657 :    
1658 :     @subsystems_it_is_in = $fig->peg_to_subsystems($peg);
1659 :    
1660 :     if (@subsystems_it_is_in > 0)
1661 :     {
1662 :     print "Subsystems that contain $peg\n\n";
1663 :     foreach $subsystem (@subsystems_it_is_in)
1664 :     {
1665 :     print "$subsystem\n";
1666 :     }
1667 :     }
1668 :     </pre>
1669 :     <h3>Going from a PEG to the maps and subsystems containing it</h3>
1670 :     <hr>
1671 :     Here, again, there are some of points worth discussing:
1672 :     <ul>
1673 :     <li> First, note that the input argument is a PEG, which we use to get
1674 :     the function that we will work with. You learned how to get the
1675 :     function in a previous example, but note the use of
1676 :     <pre>
1677 :     &FIG::hypo($func)
1678 :     </pre>
1679 :     to check to see whether or not the function is equivalent to
1680 :     <i>hypothetical protein</i>.
1681 :     <p>
1682 :     The way we explore connections between the PEG and maps is quite
1683 :     different than the way we connect the PEG to subsystems. In the case
1684 :     of maps, the connection is constructed by computing the set of roles
1685 :     to which the gene function connects, and then connecting each role to
1686 :     maps. In the case of subsystems, we simply look up which subsystems
1687 :     contain the given PEG. Subsystems explicitly contain a precise list
1688 :     of the PEGs that occur within them, which is not true of maps.
1689 :    
1690 :     <li>The key services for getting at the connections are as follows:
1691 :     <pre>
1692 :     @roles = $fig->roles_of_function($func);
1693 :     @maps = $fig->role_to_maps($role)
1694 :     @subsystems = $fig->peg_to_subsystems($peg);
1695 :     </pre>
1696 :     The first takes a function as input and returns a list of roles that
1697 :     the function connects to, the second takes a role and returns a list
1698 :     of maps to which the role connects, and the last takes a peg and
1699 :     returns the list of subsystems that contain the PEG.
1700 :     </ul>
1701 :     If you take a few minutes out and think about maps and subsystems, you
1702 :     will see that they are very, very similar concepts that are treated
1703 :     quite differently by the SEED.
1704 :    
1705 :     <h2>Annotations</h2>
1706 :    
1707 :     Annotations are time-stamped text notes made by a named user and
1708 :     attached to specific features. The most common are notes recording
1709 :     when someone made an assignment of function to a PEG, but they can be
1710 :     used to record anything. The usual way to access the annotations
1711 :     attached to a given feature with an ID designated by <i>$fid</i> is to
1712 :     use
1713 :     <pre>
1714 :     @annotations = $fig->feature_annotations($fid)
1715 :     </pre>
1716 :     The returned list contains 4-tuples (i.e., pointers to lists, each of
1717 :     which contains 4 entries). Each 4-tuple contains
1718 :     <ol>
1719 :     <li> the FID (this is redundant, but it was convenient for reasons
1720 :     that will become apparent),
1721 :     <li>a timestamp,
1722 :     <li>the name of the user that made the annotation, and
1723 :     <li>the annotation in the form of a text string.
1724 :     </ol>
1725 :     The following short program uses
1726 :     <pre>
1727 :     @annotations = $fig->merged_related_annotations($fig_ids)
1728 :     </pre>
1729 :     which is a useful generalization. This routine takes a pointer to a list of
1730 :     feature IDs as input, and it produces a merged list of the annotations
1731 :     attached to all of the features. This can often be very convenient.
1732 :     For example, you might have a set of similar PEGs that are all
1733 :     misannotated, and you wish to reconstruct the steps that led to the
1734 :     current situation. By merging the annotations (which normally record
1735 :     both who made the annotations and why), you can get a record of the
1736 :     events in chronilogical order. The following short example
1737 :     illustrates this utility:
1738 :    
1739 :     <hr>
1740 :     <pre>
1741 :     use FIG;
1742 :     my $fig = new FIG;
1743 :     use strict;
1744 :    
1745 :     my($usage,$fig_ids,@annotations);
1746 :    
1747 :     $usage = "usage: annotations_of Fid1 Fid2 ...";
1748 :    
1749 :     (@ARGV > 0)
1750 :     || die $usage;
1751 :    
1752 :     $fig_ids = [@ARGV];
1753 :    
1754 :     @annotations = $fig->merged_related_annotations($fig_ids);
1755 :     if (@annotations > 0)
1756 :     {
1757 :     &display_annotations(\@annotations);
1758 :     }
1759 :     else
1760 :     {
1761 :     print STDERR "Sorry, there were no annotations attached to the features\n";
1762 :     }
1763 :    
1764 :     sub display_annotations {
1765 :     my($annotations) = @_;
1766 :     my($annotation,$fid,$timestamp,$user,$text);
1767 :    
1768 :     foreach $annotation (@$annotations)
1769 :     {
1770 :     ($fid,$timestamp,$user,$text) = @$annotation;
1771 :     print "=======\n";
1772 :     print "$user\t$timestamp\t$fid\n$text\n";
1773 :     print "=======\n\n";
1774 :     }
1775 :     }
1776 :    
1777 :     </pre>
1778 :     <h3>Accessing the annotations from a set of features</h3>
1779 :     <hr>
1780 :     Invoking this little utility using
1781 :     <pre>
1782 :     annotations_of 'fig|207559.1.peg.735' 'fig|12149.1.peg.2133' 'fig|1525.1.peg.1757'
1783 :     </pre>
1784 :     produced the following output:
1785 :     <pre>
1786 :     =======
1787 :     automated_assignment Thu Mar 18 17:42:51 2004 fig|12149.1.peg.2133
1788 :     Set master function to
1789 :     ATP phosphoribosyltransferase (EC 2.4.2.1
1790 :     =======
1791 :    
1792 :     =======
1793 :     automated_assignment Thu Mar 18 17:45:16 2004 fig|1525.1.peg.1757
1794 :     Set master function to
1795 :     ATP phosphoribosyltransferase (EC 2.4.2.1
1796 :     =======
1797 :    
1798 :     =======
1799 :     last_release Fri Apr 2 14:14:07 2004 fig|207559.1.peg.735
1800 :     Set master function to
1801 :     ATP phosphoribosyltransferase (EC 2.4.2.1
1802 :     =======
1803 :     </pre>
1804 :    
1805 :     <h2>Assigning Gene Functions</h2>
1806 :    
1807 :     We can now easily talk about assigning functions and making
1808 :     annotations to record the events. Here is a simple utility
1809 :     that can be used to make annotations:
1810 :    
1811 :     <hr>
1812 :     <pre>
1813 :     use FIG;
1814 :     my $fig = new FIG;
1815 :    
1816 :     use strict;
1817 :     my($usage,$user,$annotation,$text,$peg,$func,$conf,$user_name,$funcO);
1818 :    
1819 :     $usage = "usage: batch_assignments User [AnnotationFile] < Assignments";
1820 :    
1821 :     ($user = shift @ARGV)
1822 :     || die $usage;
1823 :    
1824 :     if ($annotation = shift @ARGV)
1825 :     {
1826 :     $text = join("",`cat $annotation`);
1827 :     }
1828 :     else
1829 :     {
1830 :     $text = "";
1831 :     }
1832 :    
1833 :     while (defined($_ = <STDIN>))
1834 :     {
1835 :     chop;
1836 :     ($peg,$func,$conf) = split(/\t/,$_);
1837 :     if (! $conf) { $conf = "" }
1838 :     if ($user =~ /master:(.*)/)
1839 :     {
1840 :     $user_name = $1;
1841 :     $funcO = $fig->function_of($peg);
1842 :     if ($funcO ne $func)
1843 :     {
1844 :     $fig->assign_function($peg,"master",$func,$conf);
1845 :     $fig->add_annotation($peg,$user_name,"Set master function to\n$func\n$text");
1846 :     }
1847 :     }
1848 :     else
1849 :     {
1850 :     $funcO = $fig->function_of($peg,$user);
1851 :     if ($funcO ne $func)
1852 :     {
1853 :     $fig->assign_function($peg,$user,$func,$conf);
1854 :     $fig->add_annotation($peg,$user,"Set function to\n$func\n$text");
1855 :     }
1856 :     }
1857 :     }
1858 :    
1859 :    
1860 :     </pre>
1861 :     <h3>A utility to make batch assignments</h3>
1862 :     <hr>
1863 :     Before discussing the details of this code, we should discuss the
1864 :     issue of "master and non-master users". Each gene can have a single
1865 :     <i>master assignment</i> and any number of <i>non-master
1866 :     assignments</i>. A user who has signed in as <i>master:USER</i>
1867 :     (e.g., "master:RossO") is a master user. If a master user makes an
1868 :     assignment, it overwrites the master assignment. If a non-master user
1869 :     makes an assignment, it alters only the assignment corresponding to
1870 :     that user, if there is one. When a master asks for the function of a
1871 :     gene, he gets the master assignment. If a non-master asks for the
1872 :     function of a gene, he will get the assignment corresponding to his
1873 :     user id, if there is one; if not, he will get the master assignment,
1874 :     if there is one.
1875 :     <p>
1876 :     There is also the matter of <i>confidence levels</i>. The SEED
1877 :     supports attaching confidence codes to assignments, although they are
1878 :     seldom used. The codes we use at this point are:
1879 :     <ul>
1880 :     <li> an empty string, a space or <b>N</B> means "normal or average confidence",
1881 :     <li> an <b>H</b> or an <b>S</B> means "high or strong confidence", and
1882 :     <li>an <b>L</b> or a <b>W</b> means "low or weak confidence".
1883 :     </ol>
1884 :     We are not really employing the codes actively. The major source of
1885 :     strength in an assignment is being part of a subsystem or a
1886 :     well-characterized class of proteins. This will clarify substantially
1887 :     as the Project to Annotate 1000 Genomes progresses.
1888 :     <p>
1889 :     Now, back to the example program:
1890 :     <ul>
1891 :     <li>
1892 :     The program is written to allow the
1893 :     user to optionally attach a block of annotation to each of the genes
1894 :     that get reassigned. If the optional argument (the second argument)
1895 :     is prtesent, it is presumed to be the name of a file containing text
1896 :     to be appended to the annotation the program makes to record the
1897 :     change of assignment. The line of code
1898 :     <pre>
1899 :     $text = join("",`cat $annotation`);
1900 :     </pre>
1901 :     just reads the entire file, concatenates the lines, and assigns the
1902 :     result to the variable <i>$text</i>.
1903 :     <li>
1904 :     We default the <i>confidence level</i> to the empty string.
1905 :     <li> We only make asssignments if the current function differs from
1906 :     the one to be assigned. <i>$funcO</i> is used to hold the existing
1907 :     asssignment. Note that
1908 :     <pre>
1909 :     $funcO = $fig->function_of($peg,$user);
1910 :     </pre>
1911 :     uses a form of <i>$fig->function_of</i> in which a second argument is
1912 :     used to indicate a non-master user.
1913 :     <li>
1914 :     The
1915 :     <pre>
1916 :     $fig->assign_function($peg,$user,$func,$conf);
1917 :     </pre>
1918 :     is used to actually make the assignment.
1919 :     <li>We record the assignment by adding an annotation (possibly with
1920 :     the text from the optional file appended) using
1921 :     <pre>
1922 :     $fig->add_annotation($peg,$user,"Set function to\n$func\n$text");
1923 :     </pre>
1924 :     </ul>
1925 :    
1926 :     <h2>BBHs</h2>
1927 :    
1928 :     Now, let us return briefly to the topic of similarities. The notion
1929 :     of <i>bi-directional best hit</i> (<b>BBH</b>) often pops up in
1930 :     discussions relating to similarities and how to project assignments
1931 :     from a gene in one genome to a gene in another. A gene <i>X</i> in
1932 :     genome <i>Gx</i> is a BBH of a gene <i>Y</i> in <i>Gy</i> iff
1933 :     <ol>
1934 :     <li><i>X</i> is similar to <i>Y</i>,
1935 :     <i>there is no other gene in <i>Gy</i> more similar to <i>X</i>, and
1936 :     <i>there is no other gene in <i>Gx</i> more similar to <i>Y</i>.
1937 :     </ol>
1938 :     To get the BBHs for a given PEG, you would use the something like the
1939 :     following:
1940 :     <pre>
1941 :     @bbhs = $fig->bbhs($peg,1.0e-10);
1942 :     </pre>
1943 :     This would set <i>@bbhs</i> to a list of 2-tuples. Each 2-tuple would
1944 :     contain a PEG and a p-score. They would correspond to BBHs which have
1945 :     similarity better than 1.0e-10.
1946 :    
1947 :     <h2>The Use of dsims</h2>
1948 :    
1949 :     We have constructed a means of computing <i>dynamic similarities</i>
1950 :     fairly rapidly. The basic idea involves breaking the nonredundant
1951 :     protein database into many thousands of subsets, where all of the
1952 :     sequences in a subset are quite similar (sequences that do not fall
1953 :     into subsets get placed into a <i>nohits database</i>). Then, one or more
1954 :     representative sequences from each subset are included in a database
1955 :     of <i>exemplars</i>.
1956 :     <p>
1957 :     To get <i>blastp</i> similarities quickly (and somewhat less
1958 :     accurately than just blasting against the nonredundant database), use
1959 :     something like
1960 :     <pre>
1961 :     @sims = $fig->dsims($id,$seq,200,1.0e-10,"raw")
1962 :     </pre>
1963 :     This would return the similarities that can rapidly be detected
1964 :     between the protein sequence with id given by <i>id</i> and sequence
1965 :     given by <i>$seq</i>. It would give you at most 200 similarities
1966 :     with p-scores better than 1.0e-10, and you would get only the
1967 :     <i>raw</i> similarities (<i>all</i> would return the expanded
1968 :     similarities).
1969 :    
1970 :     <h1> PAUSE: I am pausing here; I hope to finish this tutorial within a few months</h1>
1971 :    
1972 :     <h2>Clusters, Pins, and Largest Clusters</h2>
1973 :    
1974 :     <h2>The Indexes</h2>
1975 :    
1976 :     <h2>Administrative Issues</h2>
1977 :    
1978 :     <h2>Atomated Assignment of Function</h2>
1979 :    
1980 :     <h2>Finding Candidates for a Functional Role</h2>
1981 :    
1982 :     <h2>Protein Families</h2>
1983 :    
1984 :     <h2>Compounds and Reactions</h2>
1985 :    
1986 :     <h2>Subsystems</h2>
1987 :    
1988 :     <h2>Links to PEGs</h2>
1989 :    
1990 :     <h2>Miscellaneous Stuff that You Should Know</h2>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3