[Bio] / KBaseTutorials / Basic_exercises / CS-API_UI.html Repository:
ViewVC logotype

View of /KBaseTutorials/Basic_exercises/CS-API_UI.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (download) (as text) (annotate)
Thu Jun 14 14:06:41 2012 UTC (7 years, 5 months ago) by disz
Branch: MAIN
CVS Tags: HEAD
Changes since 1.2: +17 -16 lines
Added Introduction

<h1>Extracting Data from the CS Using the CS-API: Some Typical Examples</h1>

<h2>Introduction</h2>
We will cover the basic tools for accessing data in the CS via the CS-API (as opposed to
the more common use of the command-line tools.  We will give small test programs along
with displayed output in hopes that users can easily generate what they need by making minor modifications.<br><br><br><br><div style="text-align: center;"><big style="font-weight: bold;"><big>Table of Contents</big></big><br></div><ul><li><a href="#mozTocId403269"> Extracting Data Corresponding to the Genome Entity</a></li><li><a href="#mozTocId789578"> Extracting Data Corresponding to the Feature Entity</a></li><li><a href="#mozTocId477963">Looking Up the Functions Assigned to Fids</a></li><li><a href="#mozTocId279369"> Getting the DNA Sequence for  Feature Entities</a></li><li><a href="#mozTocId231552">Extracting the Protein Families that Contain One or More Features</a></li><li><a href="#mozTocId439283">Getting Literature Related to a Given Feature</a></li><li><a href="#mozTocId787030">Accessing the Annotations Associated with One or More Features</a></li><li><a href="#mozTocId948659">Locating Features that Tend to Co-occur with a Given Feature</a></li><li><a href="#mozTocId611557">Locating Subsystems that Contain a Given Feature</a></li><li><a href="#mozTocId182481">What Atomic Regulons Contain a fid (if any)?</a></li><li><a href="#mozTocId300685">Getting the Pearson Correlation Coefficient for Apparently Coexpressed Features</a></li><li><a href="#mozTocId57110">Finding All Features with the Same Protein Sequence</a></li><li><a href="#mozTocId586120">Accessing Members of a Protein Family and the Family Functions</a></li><li><a href="#mozTocId458855">Getting a Set of Co-occurring Protein Families</a></li><li><a href="#mozTocId193709">Given a Role, which Subsystems Include it?</a></li><li><a href="#mozTocId869371">Going from Roles to Complexes to Reactions to Printable Versions of Reactions</a></li></ul><br>

<h3><a class="mozTocH2" name="mozTocId403269"></a> Extracting Data Corresponding to the <b>Genome</b> Entity</h3>
You can access data relating to several genomes with a single invocation of <i>genomes_to_genome_data</i>,
but most of the time you will be after data relating to a single genome (in the example, <i>kb|g.0</i>).
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $gH = $csO-&gt;genomes_to_genome_data(['kb|g.0']);<br>print &amp;Dumper($gH)<br><br></pre>
produces the following output:
<br><pre>$VAR1 = {<br>          'kb|g.0' =&gt; {<br>                        'rnas' =&gt; '170',<br>                        'gc_content' =&gt; '50.7888716661698',<br>                        'dna_size' =&gt; '4639221',<br>                        'taxonomy' =&gt; 'Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia coli K-12',<br>                        'scientific_name' =&gt; 'Escherichia coli K12',<br>                        'contigs' =&gt; '1',<br>                        'genome_md5' =&gt; 'b4ff0a0fea9686b26b4e2a3cf7b6adbf',<br>                        'pegs' =&gt; '4308',<br>                        'genetic_code' =&gt; '11',<br>                        'complete' =&gt; '1'<br>                      }<br>        };<br><br></pre>

You might wish to determine which subsystems include a given genome.  To get that, you should use something like
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $genomes = ['kb|g.0'];<br>my $gH     = $csO-&gt;genomes_to_subsystems($genomes);<br>my $dataH  = {};<br>foreach my $g (@$genomes)<br>{<br>    my $x  = $gH-&gt;{$g};<br>    if (! $x)<br>    {<br>	$dataH-&gt;{$g} = [];<br>    }<br>    else<br>    {<br>	# we reorder the ouput and throw out entries with variants that are not active<br>	my @pairs = map { [$_-&gt;[1],$_-&gt;[0]] } grep { $_-&gt;[0] !~ /^\*?(0|-1)$/ } @$x;<br>	$dataH-&gt;{$g} = \@pairs;<br>    }<br>}<br><br>print &amp;Dumper($dataH);<br><br></pre>
The data passed back includes subsystems in which the genome has been placed
with a variant code indicating that it is not active.  You probably will wish to
exclude these, and the snippet of code shows you how to do it.
Running the code should produce something like
<br><pre>$VAR1 = {<br>          'kb|g.0' =&gt; [<br>                        [<br>                          'Restriction-Modification System',<br>                          '1.0'<br>                        ],<br>                        [<br>                          'Citrate Metabolism, Transport, and Regulation',<br>                          '1'<br>                        ],<br>                        [<br>                          'The usher protein HtrE fimbrial cluster',<br>                          '1'<br>                        ],<br>                        [<br>                          'Methylglyoxal Metabolism',<br>                          '3.0'<br>                        ],<br>                        [<br>                          'Synthesis of osmoregulated periplasmic glucans',<br>                          '1.3'<br>                        ],<br>                        [<br>                          'At5g37530 (CsdL protein family)',<br>                          '1'<br>                        ],<br>                        [<br>                          'Housecleaning nucleoside triphosphate pyrophosphatases',<br>                          '1'<br>                        ],<br>                        [<br>                          'A Hypothetical Protein Related to Proline Metabolism',<br>                          '1'<br>                        ],<br>                        [<br>                          'Rcs phosphorelay signal transduction pathway',<br>                          '1'<br>                        ],<br>.<br>.<br>.<br>                      ]<br>        };<br><br><br></pre>
The 2-tuples are [subsystem-name,variant-code].
<p>

</p><h3><a class="mozTocH2" name="mozTocId789578"></a> Extracting Data Corresponding to the <b>Feature</b> Entity</h3>
You can get data for one or more <i>Features</i> using something like
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $dataH = &amp;fids_to_feature_data($csO,['kb|g.0.peg.2','kb|g.0.peg.3']);<br>print &amp;Dumper($dataH);<br><br>sub fids_to_feature_data {<br>    my($csO,$fids) = @_;<br><br>    my $dataH = $csO-&gt;fids_to_feature_data($fids);<br>    foreach my $fid (keys(%$dataH))<br>    {<br>	$dataH-&gt;{$fid}-&gt;{feature_location} = &amp;loc_to_locstring($dataH-&gt;{$fid}-&gt;{feature_location});<br>    }<br>    return $dataH;<br>}<br><br>sub loc_to_locstring {<br>    my($loc) = @_;<br><br>    return join(",",map { "$_-&gt;[0]\_$_-&gt;[1]$_-&gt;[2]$_-&gt;[3]" } @$loc);<br>}<br><br><br></pre>
which produces the following output:
<br><pre>$VAR1 = {<br>          'kb|g.0.peg.3' =&gt; {<br>                              'feature_id' =&gt; 'kb|g.0.peg.3',<br>                              'feature_publications' =&gt; [],<br>                              'feature_length' =&gt; '1986',<br>                              'feature_location' =&gt; 'kb|g.0.c.1_1342781-1986',<br>                              'genome_name' =&gt; 'Escherichia coli K12',<br>                              'feature_function' =&gt; 'hypothetical protein'<br>                            },<br>          'kb|g.0.peg.2' =&gt; {<br>                              'feature_id' =&gt; 'kb|g.0.peg.2',<br>                              'feature_publications' =&gt; [<br>                                                          [<br>                                                            '11527384',<br>                                                            'http://www.ncbi.nlm.nih.gov/pubmed/11527384',<br>                                                            'CueO is a multi-copper oxidase that confers copper tolerance in Escherichia coli.'<br>                                                          ],<br>                                                          [<br>                                                            '11867755',<br>                                                            'http://www.ncbi.nlm.nih.gov/pubmed/11867755',<br>                                                            'Crystal structure and electron transfer kinetics of CueO, a multicopper oxidase required for copper homeostasis in Escherichia coli.'<br>                                                          ]<br>                                                        ],<br>                              'feature_length' =&gt; '1551',<br>                              'feature_location' =&gt; 'kb|g.0.c.1_137083+1551',<br>                              'genome_name' =&gt; 'Escherichia coli K12',<br>                              'feature_function' =&gt; 'Blue copper oxidase CueO precursor'<br>                            }<br>        };<br><br></pre>
Note that we converted the <i>location</i> from a list of regions to a printable string.
You may well wish to retain it in its list format, if you intend to work with it.

<h3><a class="mozTocH2" name="mozTocId477963"></a>Looking Up the Functions Assigned to Fids</h3>
To find out the current functions assigned to one or more fids, use something like
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $fids  = ['kb|g.0.peg.2','kb|g.0.peg.4'];<br><br>my $fidH = $csO-&gt;fids_to_functions($fids);<br>print &amp;Dumper($fidH);<br><br></pre>
This prodcuces output similar to
<br><pre>$VAR1 = {<br>          'kb|g.0.peg.2' =&gt; 'Blue copper oxidase CueO precursor',<br>          'kb|g.0.peg.4' =&gt; 'L,D-transpeptidase YcfS'<br>        };<br><br></pre>

<h3><a class="mozTocH2" name="mozTocId279369"></a> Getting the DNA Sequence for  <b>Feature</b> Entities</h3>
The following short piece of code extracts the DNA sequences of two features and reformats
them as fasta entries.
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $dataH = &amp;ids_to_dna_sequences($csO,['kb|g.0.peg.2','kb|g.0.peg.3']);<br>print &amp;Dumper($dataH);<br><br>sub ids_to_dna_sequences {<br>    my($csO,$fids) = @_;<br><br>    my $to_seqH  = $csO-&gt;fids_to_dna_sequences($fids);<br>    my $dataH  = {};<br>    foreach my $fid (@$fids)<br>    {<br>	my $seq = $to_seqH-&gt;{$fid};<br>	$dataH-&gt;{$fid} = &amp;to_fasta($fid,$seq);<br>    }<br>    return $dataH;<br>}<br><br>sub to_fasta {<br>    my($id,$seq) = @_;<br><br>    return &amp;SeedUtils::create_fasta_record($id,'',$seq);<br>}<br><br><br></pre>
It produces the following output:
<br><pre>$VAR1 = {<br>          'kb|g.0.peg.3' =&gt; '&gt;kb|g.0.peg.3<br>atgaaaaccgttagggagtccacaacgttgtacaactttctcggatcgcacaatccatac<br>tggcggttgacggaaagcagcgatgttttgcgcttttctaccaccgaaaccacagaacct<br>gatcgtacattgcagttatctgccgaacaggctgctcgcatcagggaaatgacggtcatc<br>acctccagcctgatgatgagtctgaccgtcgatgaaagcgatctttctgtgcatctggta<br>ggacgaaaaatcaataaacgggaatgggctggcaacgcgtctgcatggcatgacacaccg<br>gcggttgctcgtgatttatcacacgggctttcctttgctgagcaggtagtttctgaagca<br>cattccgcaatagtgattctcgacagccgggggaatatccaacgcttcaatcggttatgt<br>gaagattacacagggttgaaagaacacgacgtcattgggcaaagcgtgtttaaactgttt<br>atgagccgtcgtgaagctgcggcatccaggcgcaataaccgtgtattttttcgaagcggc<br>aatgcatatgaagtcgaactgtggataccaacatgtaaaggccagcggctgtttctgttt<br>cgcaataaatttgtccacagcggcagtggcaaaaacgagatttttttaatctgttccggc<br>accgacattaccgaagagcgccgcgctcaggagcgactgcgtattctggcaaataccgac<br>agtatcaccggactgccgaatcgtaacgcaatgcaggatttaatcgatcacgctattaat<br>catgcagataacaataaagttggggttgtgtatcttgatttggataatttcaaaaaggtc<br>aacgacgcctatgggcatttgtttggtgaccagttattacgcgacgtgtcattggctatt<br>ttaagctgtctcgaacatgaccaggtgttggcgcgtccaggtggggatgagtttctggta<br>ctggcatccaacacctcacaaagcgcgctggaagcaatggcatcacgaattttgacccgc<br>ttacggctcccctttcgcattggtttaattgaagtttataccagctgttcagtaggtatt<br>gcactctctcccgaacatggttcagacagcacggctattattcgtcacgccgacacagca<br>atgtacacagcgaaggaaggcggacgaggacaattttgcgtttttaccccagaaatgaat<br>caacgggtatttgaatatctctggctggataccaacttgcgtaaagcactggaaaacgat<br>cagttggttattcactatcaaccgaaaatcacctggcgtggcgaagtgcgcagtctggaa<br>gcactagtacgttggcagtcacctgaacgtgggttgattccaccgttggacttcatttcc<br>tacgccgaagagtcagggctaattgtgcctttaggccgttgggtgattctcgatgtcgta<br>cgccaggtggcaaagtggcgggataaaggcataaacctgcgagtggcggtaaatatttct<br>gcacgtcagctcgccgatcaaaccattttcaccgccctgaaacaggttctccaggaactc<br>aattttgaatactgccctatagatgttgaactgacagagagttgtctgattgagaatgat<br>gaactggcactgtctgttattcaacaatttagccaactaggtgcgcaagtgcatctggat<br>gattttggtaccggctactcttcactttcgcaactggcgcgctttccgatcgatgccatc<br>aaacttgaccaggtttttgttcgagatattcacaaacaacctgtctcgcagtcactggtc<br>cgggcgatcgtcgctgtggcccaggcattgaatcttcaggtgatcgccgaaggtgtagag<br>agtgcaaaggaagatgcttttttaaccaagaacgggatcaatgagcggcaaggatttttg<br>tttgccaaaccgatgcccgccgtcgccttcgaacgctggtataaacgctatctgaagcgc<br>gcataa<br>',<br>          'kb|g.0.peg.2' =&gt; '&gt;kb|g.0.peg.2<br>atgcaacgtcgtgatttcttaaaatattccgtcgcgctgggtgtggcttcggctttgccg<br>ctgtggagccgcgcagtatttgcggcagaacgcccaacgttaccgatccctgatttgctc<br>acgaccgatgcccgtaatcgcattcagttaactattggcgcaggccagtccacctttggc<br>gggaaaactgcaactacctggggctataacggcaatctgctggggccggcggtgaaatta<br>cagcgcggcaaagcggtaacggttgatatctacaaccaactgacggaagagacaacgttg<br>cactggcacgggctggaagtaccgggtgaagtcgacggcggcccgcagggaattattccg<br>ccaggtggcaagcgctcggtgacgttgaacgttgatcaacctgccgctacctgctggttc<br>catccgcatcagcacggcaaaaccgggcgacaggtggcgatggggctggctgggctggtg<br>gtgattgaagatgacgagatcctgaaattaatgctgccaaaacagtggggtatcgatgat<br>gttccggtgatcgttcaggataagaaatttagcgccgacgggcagattgattatcaactg<br>gatgtgatgaccgccgccgtgggctggtttggcgatacgttgctgaccaacggtgcaatc<br>tacccgcaacacgctgccccgcgtggttggctgcgcctgcgtttgctcaatggctgtaat<br>gcccgttcgctcaatttcgccaccagcgacaatcgcccgctgtatgtgattgccagcgac<br>ggtggtctgctacctgaaccagtgaaggtgagcgaactgccggtgctgatgggcgagcgt<br>tttgaagtgctggtggaggttaacgataacaaaccctttgacctggtgacgctgccggtc<br>agccagatggggatggcgattgcgccgtttgataagcctcatccggtaatgcggattcag<br>ccgattgctattagtgcctccggtgctttgccagacacattaagtagcctgcctgcgtta<br>ccttcgctggaagggctgacggtacgcaagctgcaactctctatggacccgatgctcgat<br>atgatggggatgcagatgctaatggagaaatatggcgatcaggcgatggccgggatggat<br>cacagccagatgatgggccatatggggcacggcaatatgaatcatatgaaccacggcggg<br>aagttcgatttccaccatgccaacaaaatcaacggtcaggcgtttgatatgaacaagccg<br>atgtttgcggcggcgaaagggcaatacgaacgttgggttatctctggcgtgggcgacatg<br>atgctgcatccgttccatatccacggcacgcagttccgtatcttgtcagaaaatggcaaa<br>ccgccagcggctcatcgcgcgggctggaaagataccgttaaggtagaaggtaatgtcagc<br>gaagtgctggtgaagtttaatcacgatgcaccgaaagaacatgcttatatggcgcactgc<br>catctgctggagcatgaagatacggggatgatgttagggtttacggtataa<br>'<br>        };<br><br><br></pre>
Changing <i>fids_to_dna_sequences</i> to <i>fids_to_protein_sequences</i> would produce

<br><pre>$VAR1 = {<br>          'kb|g.0.peg.3' =&gt; '&gt;kb|g.0.peg.3<br>MKTVRESTTLYNFLGSHNPYWRLTESSDVLRFSTTETTEPDRTLQLSAEQAARIREMTVI<br>TSSLMMSLTVDESDLSVHLVGRKINKREWAGNASAWHDTPAVARDLSHGLSFAEQVVSEA<br>HSAIVILDSRGNIQRFNRLCEDYTGLKEHDVIGQSVFKLFMSRREAAASRRNNRVFFRSG<br>NAYEVELWIPTCKGQRLFLFRNKFVHSGSGKNEIFLICSGTDITEERRAQERLRILANTD<br>SITGLPNRNAMQDLIDHAINHADNNKVGVVYLDLDNFKKVNDAYGHLFGDQLLRDVSLAI<br>LSCLEHDQVLARPGGDEFLVLASNTSQSALEAMASRILTRLRLPFRIGLIEVYTSCSVGI<br>ALSPEHGSDSTAIIRHADTAMYTAKEGGRGQFCVFTPEMNQRVFEYLWLDTNLRKALEND<br>QLVIHYQPKITWRGEVRSLEALVRWQSPERGLIPPLDFISYAEESGLIVPLGRWVILDVV<br>RQVAKWRDKGINLRVAVNISARQLADQTIFTALKQVLQELNFEYCPIDVELTESCLIEND<br>ELALSVIQQFSQLGAQVHLDDFGTGYSSLSQLARFPIDAIKLDQVFVRDIHKQPVSQSLV<br>RAIVAVAQALNLQVIAEGVESAKEDAFLTKNGINERQGFLFAKPMPAVAFERWYKRYLKR<br>A<br>',<br>          'kb|g.0.peg.2' =&gt; '&gt;kb|g.0.peg.2<br>MQRRDFLKYSVALGVASALPLWSRAVFAAERPTLPIPDLLTTDARNRIQLTIGAGQSTFG<br>GKTATTWGYNGNLLGPAVKLQRGKAVTVDIYNQLTEETTLHWHGLEVPGEVDGGPQGIIP<br>PGGKRSVTLNVDQPAATCWFHPHQHGKTGRQVAMGLAGLVVIEDDEILKLMLPKQWGIDD<br>VPVIVQDKKFSADGQIDYQLDVMTAAVGWFGDTLLTNGAIYPQHAAPRGWLRLRLLNGCN<br>ARSLNFATSDNRPLYVIASDGGLLPEPVKVSELPVLMGERFEVLVEVNDNKPFDLVTLPV<br>SQMGMAIAPFDKPHPVMRIQPIAISASGALPDTLSSLPALPSLEGLTVRKLQLSMDPMLD<br>MMGMQMLMEKYGDQAMAGMDHSQMMGHMGHGNMNHMNHGGKFDFHHANKINGQAFDMNKP<br>MFAAAKGQYERWVISGVGDMMLHPFHIHGTQFRILSENGKPPAAHRAGWKDTVKVEGNVS<br>EVLVKFNHDAPKEHAYMAHCHLLEHEDTGMMLGFTV<br>'<br>        };<br><br></pre>
<h3><a class="mozTocH2" name="mozTocId231552"></a>Extracting the Protein Families that Contain One or More Features</h3>
If you wish to find out what protein families include a specific feature, you
should use <i>fids_to_protein_families</i>.  Here is a simple example
that looks up the families for each of two fids:
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $ffH = $csO-&gt;fids_to_protein_families(['kb|g.1841.peg.788','kb|g.1841.peg.368']);<br>print &amp;Dumper($ffH);<br><br><br></pre>
Running the program should produce somnething like
<br><pre>$VAR1 = {<br>          'kb|g.1841.peg.368' =&gt; [<br>                                   'FIG00001992'<br>                                 ],<br>          'kb|g.1841.peg.788' =&gt; [<br>                                   'FIG01303839'<br>                                 ]<br>        };<br><br></pre>
<h3><a class="mozTocH2" name="mozTocId439283"></a>Getting Literature Related to a Given Feature</h3>
If you wish to get the literature connected to a set of fids, you can use <i>fids_to_literature</i>:
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $litH = $csO-&gt;fids_to_literature(['kb|g.0.peg.2','kb|g.0.peg.3','kb|g.0.peg.4']);<br>print &amp;Dumper($litH);<br><br><br></pre>
This produces
<br><pre>$VAR1 = {<br>          'kb|g.0.peg.2' =&gt; [<br>                              [<br>                                '11527384',<br>                                'http://www.ncbi.nlm.nih.gov/pubmed/11527384',<br>                                'CueO is a multi-copper oxidase that confers copper tolerance in Escherichia coli.'<br>                              ],<br>                              [<br>                                '11867755',<br>                                'http://www.ncbi.nlm.nih.gov/pubmed/11867755',<br>                                'Crystal structure and electron transfer kinetics of CueO, a multicopper oxidase required for copper homeostasis in Escherichia coli.'<br>                              ]<br>                            ]<br>        };<br><br><br></pre>
Note that two of the three features had no connected literature and were left out of the returned hash.

<h3><a class="mozTocH2" name="mozTocId787030"></a>Accessing the Annotations Associated with One or More Features</h3>
Features may have attached annotations.  Annotations are thought of as 3-tuples:
<ol>
<li>a text string (often describing a change in assigned function),
</li><li>who made the annotation, and
</li><li>a timestamp giving the precise time at which the annotation was made.
</li></ol>
In the little example program we show, the timestamp is expanded into a readable form for one of the features:
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $annH = $csO-&gt;fids_to_annotations(['kb|g.0.peg.2','kb|g.0.peg.3','kb|g.0.peg.4']);<br>my @tab  = map { [scalar localtime($_-&gt;[2]),$_-&gt;[1],$_-&gt;[0]] } <br>           sort { ($b-&gt;[2] &lt;=&gt; $a-&gt;[2]) or ($a-&gt;[0] cmp $b-&gt;[0]) } <br>           @{$annH-&gt;{'kb|g.0.peg.2'}};<br>print &amp;Dumper($annH,\@tab);<br><br><br></pre>
This produces as output
<br><pre>$VAR1 = {<br>          'kb|g.0.peg.3' =&gt; [<br>                              [<br>                                'Set function to<br>hypothetical protein',<br>                                'EC',<br>                                '1116992974'<br>                              ],<br>                              [<br>                                'Set function to<br>Sensory box/GGDEF family protein',<br>                                'master',<br>                                '1088696623'<br>                              ]<br>                            ],<br>          'kb|g.0.peg.2' =&gt; [<br>                              [<br>                                'Set function to<br>Blue copper oxidase CueO precursor',<br>                                'OlgaV',<br>                                1233354582<br>                              ],<br>                              [<br>                                'Role changed from \'Blue copper oxidase cueO precursor\' to \'Blue copper oxidase CueO precursor\'',<br>                                'OlgaV',<br>                                1233354582<br>                              ],<br>                              [<br>                                'Function set by OlgaV at 1233354582<br>Blue copper oxidase CueO precursor',<br>                                'OlgaV',<br>                                1233354582<br>                              ]<br>                            ],<br>          'kb|g.0.peg.4' =&gt; [<br>                              [<br>                                'Set function to<br>L,D-transpeptidase YcfS',<br>                                'claudia',<br>                                '1215629062'<br>                              ],<br>                              [<br>                                'Set function to<br>L,D-transpeptidase YcfS',<br>                                'claudia',<br>                                '1215628461'<br>                              ],<br>                              [<br>                                'L,D-transpeptidase YcfS',<br>                                'claudia',<br>                                '1215628461'<br>                              ],<br>                              [<br>                                'Set function to<br>Protein erfK/srfK precursor',<br>                                'claudia',<br>                                '1215627140'<br>                              ],<br>                              [<br>                                'Set function to<br>hypothetical protein',<br>                                'EC',<br>                                '1116992961'<br>                              ],<br>                              [<br>                                'Set function to<br>Protein erfK/srfK precursor',<br>                                'master',<br>                                '1088690624'<br>                              ]<br>                            ]<br>        };<br>$VAR2 = [<br>          [<br>            'Fri Jan 30 16:29:42 2009',<br>            'OlgaV',<br>            'Function set by OlgaV at 1233354582<br>Blue copper oxidase CueO precursor'<br>          ],<br>          [<br>            'Fri Jan 30 16:29:42 2009',<br>            'OlgaV',<br>            'Role changed from \'Blue copper oxidase cueO precursor\' to \'Blue copper oxidase CueO precursor\''<br>          ],<br>          [<br>            'Fri Jan 30 16:29:42 2009',<br>            'OlgaV',<br>            'Set function to<br>Blue copper oxidase CueO precursor'<br>          ]<br>        ];<br><br></pre>
Note that the text of annotations often include newlines.

<h3><a class="mozTocH2" name="mozTocId948659"></a>Locating Features that Tend to Co-occur with a Given Feature</h3>

Preserved contiguity on the chromosome is one of the more important clues relating to
function.  To find the fids that appear to be part of a conserved neighborhood, use
something like

<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $dataH =  $csO-&gt;fids_to_co_occurring_fids(['kb|g.1841.peg.788']);<br>print &amp;Dumper($dataH);<br><br><br></pre>
This would produce
<br><pre>$VAR1 = {<br>          'kb|g.1841.peg.788' =&gt; [<br>                                   [<br>                                     'kb|g.1841.peg.659',<br>                                     '13'<br>                                   ],<br>                                   [<br>                                     'kb|g.1841.peg.896',<br>                                     '43'<br>                                   ],<br>                                   [<br>                                     'kb|g.1841.peg.837',<br>                                     '14'<br>                                   ],<br>                                   [<br>                                     'kb|g.1841.peg.368',<br>                                     '23'<br>                                   ]<br>                                 ]<br>        };<br><br></pre>
The list returned for <i>kb|g.1841.peg.788</i> is a list of 2-tuples: [fid,score].
The score is computed as the number of distinct OTUs in which genes that look
quite similar occur close to one another on the chromosome. 

<h3><a class="mozTocH2" name="mozTocId611557"></a>Locating Subsystems that Contain a Given Feature</h3>
Suppose that you have a Feature (i.e., a fid), and you would like to know what subsystems
contain it.  You might use something like

<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $fidH = $csO-&gt;fids_to_subsystem_data(['kb|g.1841.peg.2636','kb|g.1841.peg.3010']);<br>print &amp;Dumper($fidH);<br><br><br></pre>
which would produce something like
<br><pre>$VAR1 = {<br>          'kb|g.1841.peg.2636' =&gt; [<br>                                    [<br>                                      'Arginine Deiminase Pathway',<br>                                      '1.x',<br>                                      'Carbamate kinase (EC 2.7.2.2)'<br>                                    ],<br>                                    [<br>                                      'Arginine and Ornithine Degradation',<br>                                      '1.1234',<br>                                      'Carbamate kinase (EC 2.7.2.2)'<br>                                    ],<br>                                    [<br>                                      'Polyamine Metabolism',<br>                                      '1',<br>                                      'Carbamate kinase (EC 2.7.2.2)'<br>                                    ]<br>                                  ],<br>          'kb|g.1841.peg.3010' =&gt; [<br>                                    [<br>                                      'Arginine Biosynthesis extended',<br>                                      '1.0',<br>                                      'Ornithine carbamoyltransferase (EC 2.1.3.3)'<br>                                    ],<br>                                    [<br>                                      'Arginine Deiminase Pathway',<br>                                      '1.x',<br>                                      'Ornithine carbamoyltransferase (EC 2.1.3.3)'<br>                                    ],<br>                                    [<br>                                      'Arginine and Ornithine Degradation',<br>                                      '1.1234',<br>                                      'Ornithine carbamoyltransferase (EC 2.1.3.3)'<br>                                    ]<br>                                  ]<br>        };<br><br></pre>
That is, for each fid in the list you give as input, you get back a pointer to a list of
3-tuples.  Each 3-tuple contains
<ol>
<li>a subsystem name,
</li><li>a variant code associated with the row in the subsystem containing the fid, and
</li><li>the role associated with the column containing the fid.
</li></ol>

<h3><a class="mozTocH2" name="mozTocId182481"></a>What Atomic Regulons Contain a fid (if any)?</h3>

For a very few genomes, we have substantial expression data.  For these we
generated expression profiles and tried to form sets of genes with identical
expression profiles.  We called these <i>atomic regulons</i>.  You can see whether or
not features are in atomic regulons using

<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $fidH = $csO-&gt;fids_to_atomic_regulons(['kb|g.0.peg.2','kb|g.0.peg.20']);<br>print &amp;Dumper($fidH);<br><br><br></pre>
which produces something like
<br><pre>$VAR1 = {<br>          'kb|g.0.peg.20' =&gt; [<br>                               [<br>                                 'kb|g.0.ar.86',<br>                                 5<br>                               ]<br>                             ]<br>        };<br><br><br></pre>

Suppose that you wished to find out which other fids are in atomic regulon <i>kb|g.0.ar.86</i>.
You could use
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $ar      = 'kb|g.0.ar.86';<br>my $arH     = $csO-&gt;atomic_regulons_to_fids([$ar]);<br>my $fids    = $arH-&gt;{$ar};<br>my $funcH   = $csO-&gt;fids_to_functions($fids);<br>print &amp;Dumper($fids,$funcH);<br><br></pre>
which produces
<br><pre>$VAR1 = [<br>          'kb|g.0.peg.146',<br>          'kb|g.0.peg.20',<br>          'kb|g.0.peg.205',<br>          'kb|g.0.peg.240',<br>          'kb|g.0.peg.66'<br>        ];<br>$VAR2 = {<br>          'kb|g.0.peg.66' =&gt; 'Oligopeptide transport system permease protein OppC (TC 3.A.1.5.1)',<br>          'kb|g.0.peg.205' =&gt; 'Oligopeptide ABC transporter, periplasmic oligopeptide-binding protein OppA (TC 3.A.1.5.1)',<br>          'kb|g.0.peg.240' =&gt; 'Oligopeptide transport system permease protein OppB (TC 3.A.1.5.1)',<br>          'kb|g.0.peg.146' =&gt; 'Oligopeptide transport ATP-binding protein OppD (TC 3.A.1.5.1)',<br>          'kb|g.0.peg.20' =&gt; 'Oligopeptide transport ATP-binding protein OppF (TC 3.A.1.5.1)'<br>        };<br><br><br></pre>
<h3><a class="mozTocH2" name="mozTocId300685"></a>Getting the Pearson Correlation Coefficient for Apparently Coexpressed Features</h3>

Suppose that you had fids from one of the somewhat rare genomes for which we have substantial
expression data.  To get the PCC values for apparently co-expressed fids, use
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $fids   = ['kb|g.0.peg.2257','kb|g.0.peg.1094'];<br>my $coexpH = $csO-&gt;fids_to_coexpressed_fids($fids);<br>print &amp;Dumper($coexpH);<br><br></pre>
You will get a list of the features with scores that have an absolute value greater than or equal to 0.5.

<br><pre>$VAR1 = {<br>          'kb|g.0.peg.2257' =&gt; [<br>                                 [<br>                                   'kb|g.0.peg.1094',<br>                                   '0.501'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.865',<br>                                   '0.515'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2113',<br>                                   '0.524'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1497',<br>                                   '0.531'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1388',<br>                                   '0.546'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1691',<br>                                   '0.883'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1401',<br>                                   '0.897'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2004',<br>                                   '0.767'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2108',<br>                                   '0.769'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2681',<br>                                   '0.512'<br>                                 ]<br>                               ],<br>          'kb|g.0.peg.1094' =&gt; [<br>                                 [<br>                                   'kb|g.0.peg.185',<br>                                   '0.582'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.939',<br>                                   '0.516'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.547',<br>                                   '0.504'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.518',<br>                                   '0.618'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.627',<br>                                   '0.595'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1338',<br>                                   '0.502'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.510',<br>                                   '0.522'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.966',<br>                                   '0.607'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.974',<br>                                   '0.561'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.793',<br>                                   '0.529'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.813',<br>                                   '0.536'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.600',<br>                                   '0.557'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.488',<br>                                   '0.573'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.671',<br>                                   '0.678'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.786',<br>                                   '0.612'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2307',<br>                                   '0.513'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1781',<br>                                   '0.505'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2282',<br>                                   '0.597'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1578',<br>                                   '0.549'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1422',<br>                                   '0.547'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2071',<br>                                   '0.561'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2309',<br>                                   '0.514'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1583',<br>                                   '0.522'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2331',<br>                                   '0.531'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1371',<br>                                   '0.530'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.1656',<br>                                   '0.545'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2257',<br>                                   '0.501'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2359',<br>                                   '0.513'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3081',<br>                                   '0.541'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2681',<br>                                   '0.603'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2473',<br>                                   '0.582'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3335',<br>                                   '0.588'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2615',<br>                                   '-0.525'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2474',<br>                                   '0.522'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.2388',<br>                                   '0.603'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3278',<br>                                   '0.675'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3283',<br>                                   '0.596'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3795',<br>                                   '0.505'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3573',<br>                                   '0.572'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3809',<br>                                   '0.525'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3576',<br>                                   '0.543'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3387',<br>                                   '0.516'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.3937',<br>                                   '0.533'<br>                                 ],<br>                                 [<br>                                   'kb|g.0.peg.4032',<br>                                   '0.636'<br>                                 ]<br>                               ]<br>        };<br><br></pre>

<h3><a class="mozTocH2" name="mozTocId57110"></a>Finding All Features with the Same Protein Sequence</h3>
Suppose that you have a fid, and you want to compute the set of equivalent fids
(in the sense that they have exactly the same protein sequence).  You might use something
like
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $fid  = 'kb|g.0.peg.2';<br><br>my $fidH = $csO-&gt;fids_to_proteins([$fid]);<br>my $md5  = $fidH-&gt;{$fid};<br>my $md5H = $csO-&gt;proteins_to_fids([$md5]);<br>my $fids = $md5H-&gt;{$md5};<br>print &amp;Dumper($fids);<br><br><br></pre>
That is, first you lookup the md5 value for the protein sequence of the given fid,
and then you ask for the set of fids with that md5 value for their protein sequence.
Running the little program 
produces something like
<br><pre>$VAR1 = [<br>          'kb|g.0.peg.2',<br>          'kb|g.10012.peg.2654',<br>          'kb|g.1609.peg.1954',<br>          'kb|g.1610.peg.2264',<br>          'kb|g.1748.peg.3345',<br>          'kb|g.1870.peg.354',<br>          'kb|g.2136.peg.2112',<br>          'kb|g.2226.peg.2607',<br>          'kb|g.2295.peg.2245',<br>          'kb|g.2590.peg.869',<br>          'kb|g.2806.peg.1989',<br>          'kb|g.2810.peg.2351',<br>          'kb|g.2891.peg.4128',<br>          'kb|g.2893.peg.2189',<br>          'kb|g.2942.peg.2402',<br>          'kb|g.3205.peg.2854',<br>          'kb|g.3206.peg.2686',<br>          'kb|g.3207.peg.82',<br>          'kb|g.3211.peg.3447',<br>          'kb|g.3505.peg.2125',<br>          'kb|g.3508.peg.3768',<br>          'kb|g.3511.peg.4795',<br>          'kb|g.3558.peg.499',<br>          'kb|g.842.peg.645',<br>          'kb|g.844.peg.423',<br>          'kb|g.9268.peg.514',<br>          'kb|g.9269.peg.538',<br>          'kb|g.9450.peg.3258',<br>          'kb|g.954.peg.149',<br>          'kb|g.955.peg.594',<br>          'kb|g.9604.peg.3253',<br>          'kb|g.976.peg.3756',<br>          'kb|g.977.peg.814'<br>        ];<br><br></pre>
<h3><a class="mozTocH2" name="mozTocId586120"></a>Accessing Members of a Protein Family and the Family Functions</h3>
Suppose that you have a set of protein families and you wish to lookup
their functions and members.  This can be done using code like
<br><pre><use strict;="" data::dumper;="" bio::kbase::cdmi::cdmiclient;="" use="" bio::kbase::utilities::scriptthing;="" my="" $cso="Bio::KBase::CDMI::CDMIClient-">new_for_script();<br><br>my $protein_families = ["FIG01303839"];<br>my $famsH            = $csO-&gt;protein_families_to_functions($protein_families);<br>my $pegsH            = $csO-&gt;protein_families_to_fids($protein_families);<br>print &amp;Dumper($famsH,$pegsH);<br>br&gt;</use></pre>
which priduces output like
<br><pre>$VAR1 = {<br>          'FIG01303839' =&gt; 'Enoyl-CoA hydratase (EC 4.2.1.17) / 3-hydroxyacyl-CoA dehydrogenase (EC 1.1.1.35) / 3-hydroxybutyryl-CoA epimerase (EC 5.1.2.3)'<br>        };<br>$VAR2 = {<br>          'FIG01303839' =&gt; [<br>                             'kb|g.22.peg.5488',<br>                             'kb|g.26.peg.1073',<br>                             'kb|g.45.peg.6310',<br>                             'kb|g.55.peg.4641',<br>                             'kb|g.56.peg.177',<br>                             'kb|g.59.peg.1996',<br>                             'kb|g.64.peg.2224',<br>                             'kb|g.67.peg.1310',<br>                             'kb|g.75.peg.2967',<br>                             'kb|g.76.peg.670',<br>                             'kb|g.77.peg.1250',<br>			     .<br>			     .<br>			     .<br>                             'kb|g.3883.peg.1974',<br>                             'kb|g.3884.peg.4283'<br>                           ]<br>        };<br><br></pre>
<h3><a class="mozTocH2" name="mozTocId458855"></a>Getting a Set of Co-occurring Protein Families</h3>
Protein families are thought of as co-occurring if members of the families tend to
occur close to one another on the chromosome.  That is, it is basically conserved
contiguity.
<p>
To find which protein families tend to co-occur, you can use code like

<br></p><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $fams    = ['FIG01303839','FIG00001992'];<br>my $funcsH  = $csO-&gt;protein_families_to_functions($fams);<br>my $famsH   = $csO-&gt;protein_families_to_co_occurring_families($fams);<br>print &amp;Dumper($funcsH,$famsH);<br><br></pre>
This simple program displays the functions of the input families (two of them in this simple case), and
then displays co-occurring families for each of the input families.  It produces
<br><pre>$VAR1 = {<br>          'FIG01303839' =&gt; 'Enoyl-CoA hydratase (EC 4.2.1.17) / 3-hydroxyacyl-CoA dehydrogenase (EC 1.1.1.35) / 3-hydroxybutyryl-CoA epimerase (EC 5.1.2.3)',<br>          'FIG00001992' =&gt; 'Phosphohistidine phosphatase SixA'<br>        };<br>$VAR2 = {<br>          'FIG01303839' =&gt; [<br>                             [<br>                               'FIG01304970',<br>                               29,<br>                               'Modification methylase, HemK family (EC 2.1.1.72)'<br>                             ]<br>                           ],<br>          'FIG00001992' =&gt; [<br>                             [<br>                               'FIG00002110',<br>                               15,<br>                               '2-keto-3-deoxy-D-arabino-heptulosonate-7-phosphate synthase I alpha (EC 2.5.1.54)'<br>                             ],<br>                             [<br>                               'FIG00002671',<br>                               84,<br>                               'Protease III precursor (EC 3.4.24.55)'<br>                             ],<br>                             [<br>                               'FIG00003087',<br>                               103,<br>                               'Murein endopeptidase'<br>                             ],<br>                             [<br>                               'FIG00007476',<br>                               184,<br>                               'Periplasmic fimbrial chaperone StfD'<br>                             ],<br>                             [<br>                               'FIG00009029',<br>                               186,<br>                               'Major fimbrial subunit StfA'<br>                             ],<br>                             [<br>                               'FIG00009068',<br>                               43,<br>                               'RNA polymerase sigma factor for flagellar operon'<br>                             ],<br>                             [<br>                               'FIG00011116',<br>                               140,<br>                               'Fimbriae usher protein StfC'<br>                             ],<br>                             [<br>                               'FIG00013366',<br>                               13,<br>                               'ADP-heptose--lipooligosaccharide heptosyltransferase II (EC 2.4.1.-)'<br>                             ],<br>                             [<br>                               'FIG00029709',<br>                               45,<br>                               'Flagellar biosynthesis protein FlhA'<br>                             ],<br>                             [<br>                               'FIG00030797',<br>                               68,<br>                               'C4-type zinc finger protein, DksA/TraR family'<br>                             ],<br>                             [<br>                               'FIG00039085',<br>                               69,<br>                               'Mlr0777 protein'<br>                             ],<br>                             [<br>                               'FIG00138924',<br>                               63,<br>                               'Amino acid regulated cytosolic protein'<br>                             ],<br>                             [<br>                               'FIG00146544',<br>                               128,<br>                               'Ribosomal protein L3 methyltransferase'<br>                             ],<br>                             [<br>                               'FIG00229397',<br>                               71,<br>                               'GTP-binding protein'<br>                             ],<br>                             [<br>                               'FIG01220567',<br>                               42,<br>                               'FIG01220568: hypothetical protein'<br>                             ],<br>                             [<br>                               'FIG01272345',<br>                               134,<br>                               'Putative membrane protein YfcA'<br>                             ],<br>                             [<br>                               'FIG01303839',<br>                               328,<br>                               'Enoyl-CoA hydratase (EC 4.2.1.17) / 3-hydroxyacyl-CoA dehydrogenase (EC 1.1.1.35) / 3-hydroxybutyryl-CoA epimerase (EC 5.1.2.3)'<br>                             ],<br>                             [<br>                               'FIG01304970',<br>                               52,<br>                               'Modification methylase, HemK family (EC 2.1.1.72)'<br>                             ]<br>                           ]<br>        };<br><br></pre>
Did you know that a fid implementing <i>'Enoyl-CoA hydratase (EC 4.2.1.17) / 3-hydroxyacyl-CoA dehydrogenase (EC 1.1.1.35) / 3-hydroxybutyryl-CoA 
epimerase (EC 5.1.2.3)</i> co-occurs with a peg assigned the function
<i>Phosphohistidine phosphatase SixA</i> in 328 distinct OTUs?

<h3><a class="mozTocH2" name="mozTocId193709"></a>Given a Role, which Subsystems Include it?</h3>

The notion of <i>Role</i> is central to that of a controlled vocabulary.  <i>Subsystems</i> are sets of such roles
and compose the framework within which they are connected to specific genes.  So, given a role like <i>Methionyl-tRNA synthetase (EC 6.1.1.10)</i>, which subsystem or subsystems are used to propagate the annotations?

<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $role = "SSU ribosomal protein S12p (S23e)";<br><br>my $roleH = $csO-&gt;roles_to_subsystems([$role]);<br>print &amp;Dumper($roleH);<br><br></pre>
This produces the following output:
<br><pre>$VAR1 = {<br>          'SSU ribosomal protein S12p (S23e)' =&gt; [<br>                                                   '271-Bsub',<br>                                                   'EGS prediction',<br>                                                   'Mycobacterium virulence operon involved in protein synthesis (SSU ribosomal proteins)',<br>                                                   'Ribosomal protein S12p Asp methylthiotransferase',<br>                                                   'Ribosome SSU bacterial',<br>                                                   'Virulence operon involved in protein synthesis (SSU ribosomal proteins)'<br>                                                 ]<br>        };<br><br></pre>
That is, the role is in six distinct subsystems.
<h3><a class="mozTocH2" name="mozTocId869371"></a>Going from Roles to Complexes to Reactions to Printable Versions of Reactions</h3>

<i>Roles</i> are parts of <i>complexes</i>.  A complex is made up of one or more roles.
Some complexes implement reactions.  The IDs for complexes and reactions are unreadable strings,
but they can be expanded into meaningful strings.  This little example shows how the
concepts fit together:
<br><pre>use strict;<br>use Data::Dumper;<br>use Bio::KBase::CDMI::CDMIClient;<br>use Bio::KBase::Utilities::ScriptThing;<br>my $csO = Bio::KBase::CDMI::CDMIClient-&gt;new_for_script();<br><br>my $role   = 'Pyruvate kinase (EC 2.7.1.40)';<br>my $roleH  = $csO-&gt;roles_to_complexes([$role]);<br>my @complexes = map { $_-&gt;[0] } @{$roleH-&gt;{$role}};<br>my $complexH = $csO-&gt;complexes_to_complex_data(\@complexes);<br>foreach my $complex (keys(%$complexH))<br>{<br>    my $complex_data = $complexH-&gt;{$complex};<br>    my $reactions    = $complex_data-&gt;{complex_reactions};<br>    if ($reactions &amp;&amp; (@$reactions &gt; 0))<br>    {<br>	my $reactionH = $csO-&gt;reaction_strings($reactions);<br>	foreach my $reaction (keys(%$reactionH))<br>	{<br>	    my $readable = $reactionH-&gt;{$reaction};<br>	    print join("\t",($complex,$reaction,$readable)),"\n";<br>	}<br>    }<br>}<br><br><br></pre>
The output of the program is a single line:
<br><pre>2AD4B58C-66F7-11E1-B48F-62E56F82D269	4DC986F0-66F5-11E1-B48F-62E56F82D269	ATP + Pyruvate &lt;=&gt; ADP + Phosphoenolpyruvate<br><br></pre>
So, the given role connects to one complex, which in turn implements a single reaction (which is one of
the steps in glycolysis).

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3