[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

View of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (download) (as text) (annotate)
Tue Feb 12 20:25:05 2008 UTC (11 years, 9 months ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_2008_0924, rast_rel_2008_09_30, rast_rel_2009_02_05, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24
Changes since 1.4: +931 -576 lines
additions to my notes on the abstract tutorial

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Abstraction Working Document</title>

</head>
<body>
<div align="center">
<h1>The Role of Bioinformatics in Interpretating Genomes of
Unicellular Organisms:</h1>
<h1>An Abstract View</h1>
<h2>by Ross Overbeek, ...</h2>
</div>
<h2>Introduction</h2>
This strange document began as a tutorial for computer scientists and
mathematicians. It was supposed
to somehow introduce them to the computational issues in genome
analysis.
It was requested by an instructor in a computer class. Overbeek in
attempting to respond to this request
formulated an abstraction that he began to believe had significance
beyond the tutorial.
<p>This document is a set of working notes relating to the
abstract. It is not organized properly as
an abstraction, a tutorial, or an essay on the role of bioinformatics
in support of biological research. It is,
however, organized properly as a working document that relates to all
of these goals.
</p>
<p>It begins with a development of the abstraction. This will be
suitable for mathematicians or computer scientists.
The abstraction is developed in four steps: the basic abstraction, the
enhanced abstraction needed to support
basic bioinformatics support for biologists, and finally the third step
which includes suport for the notion
of regulation. The intent throughout this discussion will be to seek a
minimal set of concepts needed to
effectively capture the essence of the required data. Unlike almost all
efforts to lay a foundation
for tutorials, software or research in biology, this effort focuses on
leaving out as much as possible.
While we do believe that there is an almost unlimited complexity that
can be introduced, and almost all of
it is needed for some specific goals, the vast majority of tools and
discussions require (we believe) relatively few
concepts. As they say, "the proof is in the pudding."
</p>
<p>The second section will feature a bit more tutorial comments.
It may well repeat much of what is in Part 1.
This part is offered as a way of easing a computer scientist of
mathematician into the issues that need to be
considered, if they wish to try to do useful research relating to the
genomics revolution. Eventually, this part
will be dramatically expanded by giving condensed summaries of the
machines of the cell broken into two broad
sets: the metabolic network and the cellular machinery not directly
included in the metabolic network. Loosely,
this separates what would be learned in a microbial biochemistry class
(when they exist) from what would
be learned in a course on molecular biology.
</p>
<p>The third part is an essay is an attempt to characterize our
view on </p>
<ul>
<li> what the main goals should be in current efforts to
advance biological knowledge via genome research,
</li>
<li> what role bioinformatics researchers have played in the
past, and
</li>
<li> what role they could productively play during the coming
few years.
</li>
</ul>
As such, it is undoubtedly an arrogant formulation by a group of
individuals with minimal background in
biology.
<p>The fourth section will focus on the imlications of the
abstractions in software development.
This is a bit of a radical proposal that makes sense to us (and is in
an area that we can
legitimately claim expertise).
</p>
<h1>Part 1: The Abstractions</h1>
<h2>The cell: a Minimal Perspective</h2>
A <b>cell</b> is a bag (i.e., a volume enclosed by a
membrane) that contains three types of things: compounds, cellular
machines, and a genome.
<p>By the term <b>compound</b> we refer to the
normal notion of chemical compound. </p>
<p>A <b>cellular machine</b> is a set of proteins
that together perform a function. Unless otherwise noted,
when we use the term <i>machine</i> we will always be
speaking of a cellular machine.
Many machines
transform one set of compounds into another set. Some machines
(transport machines) are used to move compounds into
or out of the cell. Later we will try to convey a more comprehensive
notion of what functions are implemented
by machines that we understand.
</p>
<p>A <b>protein</b> is a string of amino acids
(i.e., a string in the 20-character alphabet
{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
</p>
<p>A <b>genome</b> is a string of DNA bases (i.e., a
string in the 4-character alphabet {A,C,G,T}).
</p>
<p>A <b>gene</b> is a region in the genome that
describes how to build a
protein. The description is a sequence of 3-character codons. Each
codon corresponds to either a single amino acid or a stop codon.
There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
table of correspondences between codons and amino acids:
<br>
<br>
<table border="1">
<tbody>
<tr>
<th>Amino Acid</th>
<th>Codons</th>
</tr>
<tr>
<td>A</td>
<td>GCT, GCC, GCA, GCG </td>
</tr>
<tr>
<td>C</td>
<td>TGT, TGC</td>
</tr>
<tr>
<td>D</td>
<td>GAT, GAC</td>
</tr>
<tr>
<td>E</td>
<td>GAA, GAG</td>
</tr>
<tr>
<td>F</td>
<td>TTT, TTC</td>
</tr>
<tr>
<td>G</td>
<td>GGT, GGC, GGA, GGG</td>
</tr>
<tr>
<td>H</td>
<td>CAT, CAC</td>
</tr>
<tr>
<td>I</td>
<td>ATT, ATC, ATA</td>
</tr>
<tr>
<td>K</td>
<td>AAA, AAG</td>
</tr>
<tr>
<td>L</td>
<td>TTA, TTG, CTT, CTC, CTA, CTG</td>
</tr>
<tr>
<td>M</td>
<td>ATG</td>
</tr>
<tr>
<td>N</td>
<td>AAT, AAC</td>
</tr>
<tr>
<td>P</td>
<td>CCT, CCC, CCA, CCG</td>
</tr>
<tr>
<td>Q</td>
<td>CAA, CAG</td>
</tr>
<tr>
<td>R</td>
<td>CGT, CGC, CGA, CGG, AGA, AGG</td>
</tr>
<tr>
<td>S</td>
<td>TCT, TCC, TCA, TCG, AGT, AGC</td>
</tr>
<tr>
<td>T</td>
<td>ACT, ACC, ACA, ACG</td>
</tr>
<tr>
<td>V</td>
<td>GTT, GTC, GTA, GTG</td>
</tr>
<tr>
<td>W</td>
<td>TGG</td>
</tr>
<tr>
<td>Y</td>
<td>TAT, TAC</td>
</tr>
<tr>
<td>*</td>
<td>TAG, TGA, TAA [Stop codons]</td>
</tr>
</tbody>
</table>
<br>
<br>
</p>
<hr>The process of building a protein as a string of amino acids
from the gene containing codons is
called <b>expressing</b> the gene.
<br>
A <b>subsystem</b> (i.e., an abstract cellular machine) is
a set of functional roles.
Each protein implements one or more functional roles. The set of
functional roles
implemented by the protein is called the <b>function of the
protein</b>. The function of a multifunctional
protein that implements {functional-role-1,functional-role-2} is
normally written as
<i>functional-role-1 / functional-role-2</i>.
<br>
<br>
A <b>populated subsystem</b> is a subsystem with an
attached spreadsheet. Each column
in the spreadsheet corresponds to a functional role in the subsystem,
and each row corresponds to
a specific genome. Each cell in the spreadsheet contains the genes from
the corresponding genome
that implement the designated functional role (there may be 0 or more
such genes).
<br>
<br>
We do not actually know what machines are present in a cell. We are in
the midst of a grand
effort to clarify which are there and what they do. The formulation of
subsystems as abstract machines
in which each row of the subsystem describes a specific cellular
machine that is believed to be present,
represents a way to maintain a collection of estimates or assertions.
<p>A <b>protein family</b> is defined to be a set of
proteins that implement the same functional roles and
are similar over the entire lengths of the proteins.
</p>
<p>We seek a situation in which each protein occurs in one or
more subsystems and in a single protein family.
</p>
<p>In any specific cell, sets of specific cellular machines are
switched on and off as units. That is, they are <i>co-regulated</i>.
We will call such a set
of <i>co-regulated cellular machines</i> a <b>regulon</b>
(note that a regulon is often a set containing
a single cellular machine). A <b>state</b> of a cell will
be defined
as the set of regulons that are operational at a point in time. Thus, a
state amounts to the set
of cellular machines that are operational at one instant.
</p>
<p>Microarrays are, for a given genome, two lists of genes that
"changed expression levels" between two states of a
cell. Basicaly, the first list contains genes that were "active" during
the first state, but not the second; and the
second list contains genes that were "active" in the second but not the
first. If a cellular
machine utilizes protein <i>X</i>, and <i>X</i>
is in the first list, and if <i>X</i> is used in
only one cellular machine, then it would be reasonable to infer that
you could say that the machine was
active in the first state, but not the second.
</p>
<h2>The cell: the Enhanced Formlation Needed to Support
Bioinformatics</h2>
In the enhanced abstraction, we need to losen up some concepts. In
particular,
<ul>
<li> A <b>genome</b> is a set of strings in a
4-character alphabet. Each of the strings
is called a <b>contig</b>. Note that the concept as
formulated covers both incomplete genomes and genomes with multiple
replicons.
</li>
<li>The genes within a genome are of two distinct types:
<ol>
<li>those that describe how to construct a protein (i.e.,
prtein-encoding genes), and
</li>
<li>those that describe how to construct a string of RNA
(i.e., how to construct a string in the
4-character RNA alphabet {A,C,G,U}).
</li>
</ol>
<br>
<br>
</li>
<li>The location of a gene is generalized to be a set of
regions within the genome (that are
concatenated to form the instructions needed to construct either a
protein or a string of RNA).
</li>
<li>A protein is a character in an alphabet that now includes
the 20 character codes from
the basic abstraction plus a very limited set of extra codes. We
already have cases in which <i>selenocyctein</i> and <i>pyrrolysine</i>
appear as nonstandard translations of codons, and there may eventually
be more.
</li>
<li>Each protein-encoding gene has both a DNA sequence (by
defintion) and a translation. However,
the translation is not required to exactly match what a codon-by-codon
translation of the DNA sequence
would produce. This allows us to handle the very rare instances in
which selenocystein occurs as the translatin
of TGA or pyrrolysine occurs as a translation of TAG (and others, if
necessary).
</li>
</ul>
This loosened up formulation represents a very minimal set of changes.
They should be left out of the
basic tutorial for computer scientists and mathematicians.
<h2>The cell: Adding the Concepts Needed to Discuss
Transcriptional Regulation</h2>
In the final version of the abstraction, we add the minimal set of
notions needed to support
analysis of transcriptional regulation. An <b>operon</b>
is a set of contiguous genes that are all on the same strand and are
all co-regulated. We consider a gene that is not co-regulated with any
adjacent genes
to be an operon composed of just itself. A <b>binding site</b>
is a small region of DNA (normally
occurring a short space ahead of an operon) that acts as a switch
turning the operon "on" or "off". When
a specific protein or expressed RNA called a <b>transcriptional
regulator</b> binds the site, it flips the switch. One or more
specific transcriptional regulators can bind a specific site (i.e.,
sets of sites are associated with each specific transcriptional
regulator). The effect of a regulator binding at a site
always has the same effect (either activating or deactivating the
operon), but which effect depends on
the site-regulator pair.
<h1>Part 1: Tutorial Notes</h1>
<h2>Notes for The Basic Abstraction</h2>
We will be speaking about organisms that are a single cell. At some
point life began on earth.
The single-celled organisms that we know of replicate producing copies
of themselves that have
genomes which usually have very, very similar content to that of the
parent cell. <b>Evolution</b> is the
process in which cells replicate with some alterations in their
genomes, are subjected to
<i>selective pressure</i>, and survive or not depending on
many somewhat random factors. The makeup of
cells (i.e., the genomes they contain and the machines that define what
they are capable of doing)
changes gradually (and sometimes not so gradually) as time passes.
<p>The original life forms that existed billions of years ago
have evolved into three broad categories of
life forms. That is, the evolutinary process led to early divisions,
and these led to three main
categories of single-celled organisms. We call these three forms the <b>archaea</b>,
the <b>bacteria</b>, and the <b>eukaryotes</b>.
A majority of the organisms for which we have acquired complete genomes
are from the bacteria, although the
numbers are rapidly growing for all three domains.
</p>
<p>The minimal notion of a cell is enough to explain some of the
basic
problems in bioinformatics:
</p>
<h3>Identify the genes within a genome</h3>
If we are to understand the contents of genomes, we will need to
locate the genes that occur in each genome. This problem simply
involves taking a genome (a
string of DNA) and locating the set of genes it contains. In the case
of bacteria and archaea, we know pretty well how to
locate the genes. Once we
have identified instances from many genomes, it becomes possible to
recognize the genes in a new genome by just looking for things similar
to those we already understand. The following problem is At the heart
of reconizing when two
genes are "similar".
<h3>Given two genes. "align" them in a way that minimizes some
edit function. </h3>
For example, here is what you see when you align two genes from
distinct organisms:
<pre>gene1 ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC<br>gene2 ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC<br>* * * * * *** *** ** * **** * *** **** * ***<br>gene1 GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT<br>gene2 GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC<br>*** **** ** *** *** ** * **** *** ** * * *** * gene1 CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC<br>gene2 ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC<br>***** ***** *** **** * ** ** * ** *** ****** ***<br>gene1 ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG<br>gene2 ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG<br>********* **** ** ** ** ** ***** * ** ** ** *** ** *<br>gene1 GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC<br>gene2 GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC<br>** ** *** *********** ** **** *** ** * * ***<br>gene1 GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG<br>gene2 GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG<br>******** ** ***** ***** * * ** ** * *****<br>gene1 GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC<br>gene2 AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG<br>* **** * ********* ******* *** * *** ******** gene1 GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA<br>gene2 GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA<br>* ******** *** *** *** * ** ** * * ** ** * ****<br></pre>
<hr>
The sequences are recognizably similar, and in fact implement exactly
the same function
in the two cells. If we align the protein sequences corresponding to
these two
genes, we get
<pre>gene1 MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN<br>gene2 -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN<br> :* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***<br><br>gene1 IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT<br>gene2 IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E<br> ***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:<br><br>gene1 VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR<br>gene2 TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ<br> .:***** **.: :*.*** **::*:* * :**:.:*:<br></pre>
There is a great deal of work relating to recognizing when two
sequences are
similar and whether or not they had a common ancestor. Understanding
why
selective pressure conserves sections of sequences, but not others,
will yield
important clues. Can you reason out why some sections might be
conserved, while
others vary wildly?
<p>Comparing sets of sequences that have retained the same
function is
at the heart of understanding cellular machines and the proteins that
implement them. We find that looking at sets (often with more than two
sequences) and aligning them
is important.
</p>
<h3> Given a set of sequences, align them in a way that minimizes
some edit function.</h3>
Here is an example of a multiple sequence alignment:
<br>
<br>
<pre>CLUSTAL W (1.83) multiple sequence alignment<br><br><br>seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE<br>seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA<br>seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN<br>seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT<br>seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE<br> *. . . .: :.: **..: ** .* : :<br><br>seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL<br>seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL<br>seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL<br>seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL<br>seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL<br> * : : .: :. .. :.. : : .*: . *<br><br>seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI<br>seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI<br>seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI<br>seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV<br>seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV<br> ************.. :::..:: . * : : :: *******:*. . :.:<br><br>seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV<br>seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV<br>seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV<br>seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA<br>seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA<br> :::*:.: * :* : *: .: : * ** ** :** * * : * :.<br><br>seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI<br>seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM<br>seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI<br>seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI<br>seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI<br> **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::<br><br>seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA<br>seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA<br>seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA<br>seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------<br>seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------<br> *.* *: . : : . : ::: :**: ..*: * : *.<br><br>seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC<br>seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP<br>seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL<br>seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ<br>seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG<br> . : : :** * . * :<br><br>seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA<br>seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR<br>seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA<br>seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK<br>seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS<br> : * ****.** ::: :* * * : : .:<br><br>seq3 FVSQHGNRGKPL<br>seq4 FMSGHLGA----<br>seq5 FIEKKAL-----<br>seq1 LMMNHQ------<br>seq2 YLLGK-------<br> : :<br></pre>
<h3> Given a multiple sequence alignment, determine the most
likely evolutionary history of the sequences (i.e., construct a
phylogenetic tree).</h3>
From the extant five sequences that are similar and displayed in the
previous alignment, we can construct
a tree that depicts the "phylogenetic history" of the sequences.
Here is one reasonable tree for the last 5 sequences.
<pre>
                     ,--------------------------------------------------- seq1
                     |
                     |
  ,------------------|
  |                  |
  |                  |
  |                  `---------------------------------------------- seq2
  |
  |
  |
  |
  |
  |             ,-------------------------------- seq3
  |             |
  |             |
  |-------------|
  |             |
  |             |
  |             `------------------------------ seq4
  |
  |
  `---------------------------------------------- seq5
</pre>
The tree suggests that at some point an ancestral
cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>,
while the remaining sequences descend from the ther copy.
<p>Note that we now have alignments that
contain thousands of sequences, and even displaying such trees is
nontrivial.
Because evolution plays such a central role in the phenomena we study,
the construction of alignments
and trees in order to compare extant versions of proteins and gain
insight into their historical origins
is considered basic to the task at hand.
</p>
<h3>Some Random Facts that You Should Absorb</h3>
Most genomes of bacteria contain between 400,000 and 12,000,000
characters. Normally, the genes in a genome
cover abut 90% of the genome.
Normally, there is about one gene per 1000 characters in a bacterial
genome.
<p>So, </p>
<ul>
<li> What is the length of the average protein sequence? </li>
<li>How many genes do these
genomes have? </li>
<li>What is the average length of a gene?
</li>
</ul>
<br>
It is worth spending just a short bit of time thinking about what types
of
machines must exist in each cell. Here are a few thoughts to start with
<ul>
<li>
There must be one or more machines that support replication of the
cell. You would
need something to copy the genome, and you would need something that
could build the DNA
bases that represent the characters (i.e., you will need machines to
build the molecules
corresponding to each of the four characters in the alphabet of DNA
bases.
</li>
<li>As we mentioned, you have transport machines that take
things into and out of the cell. Many
cells can import food in the form of sugar molecules. For example, many
cells can import
<i>glucose</i> a six-carbon compound. As the compound
gets broken down into smaller compounds,
energy is salvaged from the broken bonds to power the machines in the
cell. The smaller compounds
are used as building blocks for other needs.
</li>
<li>There must be one or more machines involved in building
proteins from the descriptions in te genes.
In particular, we will need a machine for each of the amino acids
(unless the cell can import some
of them).
</li>
<li>There must be mechanisms for sensing what is going on in
the environment and allowing the cell
to react to it. For example, many cells can "swim" towards food.
</li>
</ul>
Those were just a few examples. For any cell, we have many, many
machines, and we still
do not even understand what some of them do. Later, we will try to
offer a more structured
estimate of what is already known.
<p>About 50-60% of the genes occur within 5000 characters of
another gene such that
the two genes encode proteins that are part of the same cellular
machine. This fact suggests that just having a large number of genomes
would enable a person to group
the genes into the machines they implement, without the person
understanding the functions
of the machines or the roles played by each protein.
</p>
<p>Occasionally, proteins that are usually distinct in most cells
are fused into a single protein in
a few cells. In these cases, the fused gene is (by definition) part of
a single machine, and
in most cells in which the proteins are not fused, the two distinct
proteins are separate components
of a single machine. This, too, offers clues to support analysis of
which proteins go with which machines.
</p>
<p>Biologists have figured out the roles of about 50% of the
genes. That is, they can
place the gene in a cellular machine, they know what the machine does,
and they know
the specific role of the gene in sustaining the functionality of the
machine.
<br>
<br>
<h23imposing a="" structure="" on="" characterizing="" the="" inventory="">
One central goal of bioinformatics is to support an accurate
characterization of the cellular
machinery for each cell. It is of major importance to biologsts that we
be able to support
comparative analysis of cells. Perhaps, the most important aspect of
understanding cells relates to
their origin in an evolutionary process. Cells have a long evolutionary
history dating back billions of
years. The machines we see in cells today arose in the past, so we
expect to see many current cells
using machinery that resembles what turns up in other cells. When we
compare machines from different
cells they often look remarkably similar. On the other hand, those that
had a common origin in a cell that existed billions of years in the
past may now have versions that are not very similar. Modifications,
optimizations,
and insignificant alterations all combine to explore the space of
operational possibilities for
each type of machine. Hence, we need a framework for studying
similarities and differences in the
cellular machines and the proteins that implement them.
</h23imposing></p>
<p>Here is a short formulation of one way to do this:
<br>
<br>
</p>
<ul>
<li>A <b>subsystem</b> (i.e., an abstract cellular
machine) is a set of functional roles.
</li>
<li>Each protein implements one or more functional roles. The
set of functional roles
implemented by the protein is called the <b>function of the
protein</b>. The function of a multifunctional
protein that implements {functional-role-1,functional-role-2} is
normally written as
<i>functional-role-1 / functional-role-2</i>.
<br>
<br>
</li>
<li>A <b>populated subsystem</b> is a subsystem
with an attached spreadsheet. Each column
in the spreadsheet corresponds to a functional role in the subsystem,
and each row corresponds to
a specific genome. Each cell in the spreadsheet contains the genes from
the corresponding genome
that implement the designated functional role (there may be 0 or more
such genes).
</li>
</ul>
<br>
<br>
We do not actually know what machines are present in a cell. We are in
the midst of a grand
effort to clarify which are there and what they do. The formulation of
subsystems as abstract machines
in which each row of the subsystem describes a specific cellular
machine that is believed to be present,
represents a way to maintain a collection of estimates or assertions.
<p>A <b>protein family</b> is defined to be a set of
proteins that implement the same functional roles and
are similar over the entire lengths of the proteins.
</p>
<p>We seek a situation in which each protein occurs in one or
more subsystems and in a single protein family.
The computational tasks imposed by such a goal are obvious:
</p>
<ul>
<li>We need to consruct databases that implement at least the
following entities:
<ol>
<li>cells (i.e., each cell must have an ID and a set of
attributes),
</li>
<li>genomes,
</li>
<li>genes,
</li>
<li>proteins,
</li>
<li>functional roles,
</li>
<li>subsystems, and
</li>
<li>protein families.
</li>
</ol>
</li>
<li> We need to add support for developing clues to function by
integrating data
from sources like proximity within the genome, fusions, etc.
</li>
<li>We need to support a framework for the development of
populated subsystems.
</li>
<li>We need to construct decision procedures for membership in
protein families. Some of these procedures will be quite complex,
although the majority of cases can be
handled by fairly general procedures.
</li>
</ul>
<h3>States of the Cell</h3>
The notion of <i>subsystem</i> was introduced as an <i>abstract
machine</i> -- that is, as an
attempt to create a framework for understanding variations within
specific celular machines via
a form of comparative analysis. In any specific cell, sets of specific
cellular machines are switched on and off as units. That is, they are <i>co-regulated</i>.
We will call such a set
of <i>co-regulated cellular machines</i> a <b>regulon</b>
(note that a regulon is often a set containing
a single cellular machine). A <b>state</b> of a cell will
be defined
as the set of regulons that are operational at a point in time. Thus, a
state amounts to the set
of cellular machines that are operational at one instant.
<p>If we think of a car as a bag of machines that interact to
make it function, we might consider there
to be a huge number of states. There are many very minor "machines"
like the arm rest (or the radio, or the night light) that can be on or
off. However, we can divide the states of a car into major groupings
based on the status
of some key "machines". For example, "off" (the state in which the
engine is turned off and the car is parked) and
"on" (the engine is running and the car is moving) might be viewed as a
crude partitioning of the states into
two "major states".
</p>
<p>Similarly, I believe that we should think about <i>major
states of the cell</i> as being determined by the functioning (or
not) of a limited set of regulons. The determination of these regulons,
the major states,
and how transitions between are managed all are now parts of the
picture being filed in.
</p>
<h3>Microarrays</h3>
Microarrays are, for a given genome, two lists of genes that "changed
expression levels" between two states of a
cell. Basicaly, the first list contains genes that were "active" during
the first state, but not the second; and the
second list contains genes that were "active" in the second but not the
first. If a cellular
machine utilizes protein <i>X</i>, and <i>X</i>
is in the first list, and if <i>X</i> is used in
only one cellular machine, then it would be reasonable to infer that
you could say that the machine was
active in the first state, but not the second. If one knew the regulons
for a specific cell, it would go
a long way to suport extraction of insights from these microarrays. On
the other hand, if one had many,
many microarrays, and if the specific cellular machines for the cell
are known, then one could make
substantial progress in uncovering the exact composition of the
regulons that make up the cell.<br>
<br>
We are just now reaching the point where we do, in fact, have hundreds
of microarrays (each representing changes between two sampled states of
the cell). &nbsp;<br>
Let us reflect on how one might use this data to uncover the regulons
that are represented and how they relate to the major "states of the
cell".<br>
<br>
We might begin by trying to determine sets of genes from each subsystem
that appear to "move together". &nbsp; Actually, we want to arrive
at a set of genes that perform a well-defined function, some subset of
these almost always show up in the microarrays as "moving together".
&nbsp;Of these, if we have genes that occur only in a single
subsystem, then it would be reasonable as thinking of these as <span style="font-style: italic;">signatures</span> for set
of genes. &nbsp;The most natural way to do this would be to start
with metabolic subsystems, or even better <span style="font-style: italic;">scenarios (</span>discussed
below) which are subsets of functional roles from a metabolic subsystem
such that the subset if a connected set with well-defined inputs and
outputs. &nbsp;We wish then to define discovery of the regulon sets
associated with each condition as follows:<br>
<br>
<ol>
<li>&nbsp;First, for each scenario define&nbsp;</li>
<ul>
<li>the set of genes that are expected to show up in a
microarray when the scenario is activated or deactivated (call this
"the set of genes that move together" = <span style="font-style: italic;">SGMT for the scenario),</span></li>
<br>
<li>the subset of genes (perhaps empty) of the SGMT that are <span style="font-style: italic;">signatures</span> (call
this <span style="font-style: italic;">signatures of the
scenario)</span></li>
</ul>
<br>
<li>Then define the <span style="font-style: italic;">set
of regulons</span>. &nbsp;Each regulon is &nbsp;a set of
scenarios. &nbsp;There is a cost <span style="font-weight: bold;">cost_reg</span> associated
with the definition of each regulon. &nbsp;This prevents the
definition of numerous regulons all containing just one scenario.
&nbsp;If the penalty is set too high, only one regulon will be
defined. &nbsp;If it is set too low, then a large set of small
regulons results.</li>
<br>
<li>Finally, you need to define the set of regulons that were
activated for each microarray and the set that were deactivated.</li>
<br>
<li>Now, you compute a score for your decisions as&nbsp;<span style="font-weight: bold;">score = P - M - (cost_reg *
number_of_defined_regulons * number_of_microarrays)</span> where</li>
<br>
<ul>
<li><span style="font-weight: bold;">P</span>
= <span style="font-weight: bold;">p1 + p2,</span>
where&nbsp;<span style="font-weight: bold;"></span></li>
<br>
<ul>
<li><span style="font-weight: bold;">p1</span>
= <span style="font-weight: bold;">a1 * value_signature </span>and
<span style="font-weight: bold;">a1</span>
is the number of signatures of scenarios that moved as predicted, and <span style="font-weight: bold;">value_signature </span>is
the value associated with a signature moving in the direction predicted,</li>
<br>
<li><span style="font-weight: bold;">p2 = a2 *
value_SGMT_nonsig</span> and <span style="font-weight: bold;">a2</span>
is the number of SGMT genes that moved as predicted, and <span style="font-weight: bold;">value_SGMT_nonsig</span> is
the value associated with a non-signature SGMT gene moving in the
direction predicted, and</li>
<br>
</ul>
<li><span style="font-weight: bold;">M = m1 +
m2, where</span></li>
<br>
<ul>
<li><span style="font-weight: bold;">m1 = b1 *
value_signature</span> and <span style="font-weight: bold;">b1</span>
is the number of signatures of scenarios that did not move as
predicted, &nbsp;and</li>
<br>
<li><span style="font-weight: bold;">m2 = b2 *
value_SGMT_nonsig </span>and <span style="font-weight: bold;">b2</span>
is the number of SGMT genes that did not move as predicted.&nbsp;</li>
</ul>
<br>
The&nbsp;<span style="font-weight: bold;">score </span>reflects
how well your decisions in the first three steps match the data in the
microarrays. &nbsp;The object is to make the sets of decisions in
the first three steps in a way that maximizes the&nbsp;<span style="font-weight: bold;">score.<br>
<br>
</span><span style="font-weight: bold;"></span><br>
<span style="font-weight: bold;"></span><span style="font-weight: bold;"></span>
</ul>
</ol>


<h2>Notes for the Enhanced Abstraction</h2>
The process of <b>expressing a gene</b> amounts to using
the gene to produce the functional component of
a machine (a protein for a protein-encoding gene, and an RNA for an
RNA-encoding gene).
The process of expressing a protein-encoding gene takes a gene (a
string of DNA formed by concatenating a sequence of
regions from contigs) and producing a protein is normally thought of as
taking place in two steps.
<b>Transcription</b> is the process of a specific machine
moving along the contig and making a copy of the
gene as RNA. This string of RNA is then <b>translated</b>
by a separate machine. The machine that performs
the copying of the gene into a string of RNA is called an <b>RNA
polymerase</b>. The machine to translate
the RNA into a protein, the <b>ribosome</b>, is made up of
both proteins and RNA components.
<p>Machines can be made up of both protein and RNA components,
although most machines are built from
just proteins. Some of the most fundamental questions in biology relate
to how life started and the steps
required to gradually enrich the basic machinery to the point where
this magnificent information storage and
maintenance system based on DNA, RNA and proteins could have arisen.
There is much that can be inferred by
reasoning back from what we now observe and reasoning forward from the
relatively little we know of what the early earth was like. One
possible set of goals would be to first understand in detail the
inventory
of components we now see in life forms, composing something analogous
to a CAD/CAM system describing life forms.
Then, as a second step, to understand the sequence of transformations
that led from some initial raw components
to initial life forms to those we have seen and characterized.
</p>
<p>The need to allow occasional "nonstandard" characters in
protein sequences and a loosening of the corespondence
between a gene and characters in the protein sequence it can be used to
build results from the fact that
evolution has produced the existing genetic codes and they continue to
evolve (either converging or diverging
depending on the outcome of basically random processes operating under
selective pressure).
<br>
</p>
<h2>Notes on the Abstraction Extended to Support Regulation</h2>
There are two basically different regulatory mechanisms in the cell. In
one, you have a metabolic
network in which fluxes are tightly controlled by positive and negative
feeback loops. This <b>metabolic
regulation</b> occurs very rapidly. <b>Transcriptional
regulation</b> occurs orders of magnitude more slowly. It is just
this transcriptional regulation that we consider in this extension.
<p>As the cell changes state, regulons are activated or
de-activated by
transcriptional regulators (either protein or RNA) binding to specific
sites in the DNA. This model has the redeeming characteristic of
simplicity. It is certainly the case that there are innumerable
important issues that it disregards (e.g., regulation based on DNA
packaging, due to small RNAs binding the RNAs produced by
transcription, etc.). In forming any clear notion of transcriptional
regulation and how it is achieved, we will need to carefully separate
these different mechanisms, since they have fundamentally different
modes of control and operation. We are arguing that the notion of a
protein or RNA being used to flip regulons on and off by binding to
control sites within the genome is a major form of regulation and
probably the right place to start any effort to formulate a useful
abstraction.
</p>
<h1>The Role of Bioinformatics in Supporting the Genomic
Revolution</h1>
Within the growing genomics revolution, one can easily divide
developments and
goals into those relating to advances in medicine and agricultue from
those relating to
pure science. Here we consider only issues relating to pushing advances
in basic research.
Here is an overview of our perspective:
<ol>
<li> The different life forms that now exist were produced by
an evolutionary process,
which leads to our view that comparative analysis is the key to
understanding. Biological
machines that exist in complex forms will often also still exist in
simpler forms (usually
in simpler organisms).
</li>
<li> Unravelling exactly how a machine works is more easily
done in simpler organisms. They
are easier to work with, and it is easier to gather the data needed to
support comparative analysis.
</li>
<li> This leads to the view that we should try to understand
single-celled organisms to lay
the foundation for analysis of multicelluar organisms.
</li>
<li> The characterization of unicellular life will require
access to orders of magnitude
more data than exist now (we have more-or-less complete genomes for
about 1000 genomes, but
that represents a small fraction of a percent of extant single-celled
life forms).
</li>
<li> The immediate basic steps that are taking place are
roughly:
<br>
<br>
<ol>
<li> Attempt to formulate a growing list of abstract
machines that correspond
to the many specific machines that implement te same goal. These
abstract machines (subsystems)
represent the basic units that make up life forms.
</li>
<li> Create protein and RNA families in which the members
are all homologous (share a common ancestor),
remain similar over almost all of the sequence, and all implement a
common function.
</li>
<li> Build alignments for each protein family, along with
phylogenetic trees that represent
an estimate of the history of how these specific sequences evolved.
</li>
<li>Provide a computational framework to support continued
maintenance and development of these
basic data types.</li>
</ol>
<br>
Groups are now actively pursuing all of these goals. &nbsp;For
individuals wishing to build a research program, we suggest
collaborating with an existing group or moving to one of the newer
areas that are now emerging.
</li>
<br>
<li> A limited number of groups have progressed to the point
where they can create models of an organism that display predictive
capabilities. There are many forms of modeling. In our view
it is important that we reach the state where we can routinely model
states of the cell, transitions
between states, and metabolic characteristics of the cell. We believe
that it is now possible
to create fairly comprehensive representations of the metabolic
networks of some bacteria. In these cases, we have substantial amounts
of physiological data, the number of abstract machines
in the cell is fairly limited, and it is possible to do compare the
predictions against observed results. &nbsp; An effort has begun by
a
team within the SEED project, led by researchers from Hope Colege, to
develop a library of what they call&nbsp;<span style="font-style: italic;">scenarios</span>.
&nbsp;These scenarios capture the idea of a specific machine
implementing a metabolic transformation operating with well-defined
inputs and outputs. From a large and growing number of scenarios in
this library, they automatically reconstruct metabolic networks for
most of the bacteria for which genomes have been sequenced.
&nbsp;This
effort is seeting the stage for widespread whole genome metabolic
modeling.&nbsp;</li>
<li>Rapid progress has been made in our ability to
recognize regulatory binding sites and to use them with knowledge of
specific machines to create a consistent picture of regulons in some
bacteria. &nbsp;This technology has been gathering adherents over
the
last five years and we believe that it will play a significant role in
clarifying regulons, additions proteins that will be added to specific
machines, and a growing understanding of states of the cell.&nbsp;</li>
</ol>
Having said all that, is it possible to list some of the
important, high-payout bioinformatic questions that are worth
pondering? &nbsp;Here is a list for your consideration:<br>
<br>
<ol>
<li>The
definition of the location of genes&nbsp; for bacterialial genomes
needs cleaning up. &nbsp;The situation is made somewhat more
interesting by a growing use of sequencing technologies that produce
systematic errors leading to numerous frameshifts and poorly called
start locations. &nbsp;Fixing these would be a problem of modest
difficulty and very modest reward. &nbsp;The situation in
eukaryotic
genomes is quite different. &nbsp;The problem of defining the genes
in
a eukaryotic genome is still quite unsolved, &nbsp;We conjecture
that</li>
<ul>
<li>the
key to progress is the use of sets of genomes (i.e., solve the problem
of defining the genes in a set of closely-related genomes first), and</li>
<li>begin
with the single-celled eukaryotic genomes first. &nbsp;There are
many
types of single-celled eukaryotes, and some of them will undoubtedly
offer major challenges. &nbsp;However, existing experience suggests
that there will be numerous <span style="font-style: italic;">fungal</span>
genomes available (for example) and that focusing on these would be a
much easier task than trying to face plants, animals, etc.</li>
</ul>
<li>The
creation of populated subsystems is essentially a task for expert
biologists. &nbsp;However, the tools to support the task are a
reasonable focus for bioinformatic projects. &nbsp;The tools needed
to
delicately separate the roles of paralogous proteins have been
illustrated in the works of Jensen and Bonner, among others.
&nbsp;These tools relate to use of alignments, trees and motifs to
define the decision procedures needed to classify proteins into one of
several closely-related families.</li>
<li>The development &nbsp;of a
self-consistent set of protein families is a task closely related to
the one above. &nbsp;At this point in time there are several major
efforts currently building such protein families. &nbsp;The
development
of protocols for maintenance of the families, studying the evolutionary
history of related families, development of motifs that characterize
specific families, and so forth all represent parts of a large
classification problem.</li>
<li>There are a class of tools that attempt to spot <span style="font-style: italic;">functional coupling</span>
between specific proteins. &nbsp;Some are bioinformatic (like the
chromosomal clustering and fusion phenomena briefly discussed above).
&nbsp;Some are essentially experimental data (e.g., protein-protein
interaction data or microarray data). &nbsp;The integration of
evidence
into a system capable of predicting whether &nbsp;or not two
specific
proteins are both components of a single machine has been attemtped,
but much more remains to be done. &nbsp;The closely-related problem
of
determining whether or not two protein families are <span style="font-style: italic;">functionally coupled</span>
(and precisely what that means) should be considered simultaneously.</li>
<li>Defining
regulons by gradually composing a consistent interpretation of
subsystems, regulatory sites, and physiological data is a task that is
semi-automated. &nbsp;Devlopment of a fully automated version seems
too
ambitious, but developing tools to increase the productivity of
biologists developing these models of transcriptional regulation is
certainly going to gain much more attention.</li>
<li>Development of a meaningful notion of <span style="font-style: italic;">states of a cell</span>
is a problem seems to us to have many of the characteristics one wants:
&nbsp;it is a problem for which relevant data is starting to
appear,
many aspects of the needed infrastructure have only recently appeared,
and the outcome may be of fundamental significance.</li>
<li>To what
extent is it possible to predict the protein families which have
instances in a given cell given the closest 10 neighboring genomes and
detailed information on the families they contain?</li>
<li>Is it possible to think of a set of protein families as <span style="font-style: italic;">major predictors</span>
that would allow you to infer the presence or absence of many other
families.</li>
</ol>
<br>
<ul>
</ul>
<h1> The Role of Abstraction in Setting the Stage for Software
Development and Modeling</h1>
In
this section, we argue that the abstraction is much more than just a
pedagogical aid. &nbsp;It will form the conceptual under-pinnings
of
the software needed to support work on the problems described in the
last section (as well as numerous others that will become apparent as
the revolution progresses).<br>
</body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3