[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

View of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.6 - (download) (as text) (annotate)
Sun Mar 29 22:02:58 2009 UTC (10 years, 7 months ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2010_0928, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2011_0119, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, HEAD
Changes since 1.5: +1013 -868 lines
an update to my abstract tutorial

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Abstraction Working Document</title></head>
<body>
<div align="center">
<h1>Understanding Single-celled Life:</h1>
<h1>An Abstract Approach</h1>
<h2>by Ralph Butler, Ross Overbeek, ...</h2>
</div>
<br>
<h1>Part 1: The Cell: a Basic Abstraction</h1>
A <b>cell</b> is a bag (i.e., a volume enclosed by a
membrane) that contains three types of things: compounds, cellular
machines, and a genome.
<p>By the term <b>compound</b> we refer to the
normal notion of chemical compound. </p>
<p>A <b>cellular machine</b> is a set of proteins
that together perform a function. Unless otherwise noted,
when we use the term <i>machine</i> we will always be
speaking of a cellular machine.
Many machines
transform one set of compounds into another set. Some machines
(transport machines) are used to move compounds into
or out of the cell. Later we will try to convey a more comprehensive
notion of what functions are implemented
by machines that we understand.
</p>
<p>A <b>protein</b> is a string of amino acids
(i.e., a string in the 20-character alphabet
{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
</p>
<p>A <b>genome</b> is a string of DNA bases (i.e., a
string in the 4-character alphabet {A,C,G,T}).
</p>
<p>A <b>gene</b> is a region in the genome that
describes how to build a
protein. The description is a sequence of 3-character codons. Each
<span style="font-style: italic;">codon</span> may
be thought of as an
instruction specifying which amino acid should come next in the protein
the gene describes. &nbsp; Thus, if the protein described by the
gene
contains 100 amino acids, then the gene would be composed of 100 codons
(i.e., 300 DNA characters) followed by a codon that means "stop here"
(a <span style="font-style: italic;">stop codon</span>).&nbsp;
There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
table of correspondences between codons and amino acids:
<br>
<br>
<table border="1">
<tbody>
<tr>
<th>Amino Acid</th>
<th>Codons</th>
</tr>
<tr>
<td>A</td>
<td>GCT, GCC, GCA, GCG </td>
</tr>
<tr>
<td>C</td>
<td>TGT, TGC</td>
</tr>
<tr>
<td>D</td>
<td>GAT, GAC</td>
</tr>
<tr>
<td>E</td>
<td>GAA, GAG</td>
</tr>
<tr>
<td>F</td>
<td>TTT, TTC</td>
</tr>
<tr>
<td>G</td>
<td>GGT, GGC, GGA, GGG</td>
</tr>
<tr>
<td>H</td>
<td>CAT, CAC</td>
</tr>
<tr>
<td>I</td>
<td>ATT, ATC, ATA</td>
</tr>
<tr>
<td>K</td>
<td>AAA, AAG</td>
</tr>
<tr>
<td>L</td>
<td>TTA, TTG, CTT, CTC, CTA, CTG</td>
</tr>
<tr>
<td>M</td>
<td>ATG</td>
</tr>
<tr>
<td>N</td>
<td>AAT, AAC</td>
</tr>
<tr>
<td>P</td>
<td>CCT, CCC, CCA, CCG</td>
</tr>
<tr>
<td>Q</td>
<td>CAA, CAG</td>
</tr>
<tr>
<td>R</td>
<td>CGT, CGC, CGA, CGG, AGA, AGG</td>
</tr>
<tr>
<td>S</td>
<td>TCT, TCC, TCA, TCG, AGT, AGC</td>
</tr>
<tr>
<td>T</td>
<td>ACT, ACC, ACA, ACG</td>
</tr>
<tr>
<td>V</td>
<td>GTT, GTC, GTA, GTG</td>
</tr>
<tr>
<td>W</td>
<td>TGG</td>
</tr>
<tr>
<td>Y</td>
<td>TAT, TAC</td>
</tr>
<tr>
<td>*</td>
<td>TAG, TGA, TAA [Stop codons]</td>
</tr>
</tbody>
</table>
<br>
<br>
</p>
<hr>The process of building a protein as a string of amino acids
from the gene containing codons is
called <b>expressing</b> the gene.
<br>
<br>
<h2>Problems in BioInformatics that Depend only on the Basic
Abstraction</h2>
<h4>Identifying Genes within the Genome</h4>
If we plan on using a genome, it will usually be necessary to identify
the genes within the genome. &nbsp;How can this best be done?
&nbsp; First, it should be noted that this can be broken into three
variations:<br>
<br>
<ol>
<li>Given no assumption of an existing body of previously
identified genes, find the genes in a new genome.</li>
<li>Given a large collection of existing genomes in which the
genes have been identified, find the set of genes in a new genome.</li>
<li>Given a large set of existing genomes, discard any existing
decisions and try to identify genes in all of them from scratch.
</ol>

When the first genome was sequenced, the first option was pretty much
the only reasonable choice (this is not completely true, since we had
many partial genomes that had already been sequenced and annotated).
People focused on developing reasonable strategies that would make the
best possible choices taking just the single genome as input.
<p>
Very quickly, the second alternative became more appropriate; it was
based on the idea of effectively exploiting the efforts that had been
expended in the early genomes to more quickly and accurately identify
the genes in each new genome.
<p>
It is worth noting that the second approach, while exploiting the
investments made in annotating the early genomes, also has the
property that early errors are frequently propagated.  If an
algorithm had called a section of an early genome a gene when it actually
was not, then when we see something similar in a new genome it might
well get improperly labeled as well.
<p>
The third approach offers an unusal perspective and opportunity.  It
suggests that we are entering an era in which we have many available
genomes, and that there might be approaches based on comparison that
would support more accurate annotations for the entire collection. 
There may be many such approaches, but we will describe just one that
is based on ideas used in creating one of the early gene-calling
systems.  Let us start by quoting the abstract from 
<b>CRITICA: coding region identification tool invoking comparative
analysis.</b> by Jonathan Badger and Gary Olsen (Mol Biol Evol. 1999
Apr;16(4):512-24.PMID: 10331277):

<blockquote>
"Gene recognition is essential to understanding existing and future
DNA sequence data. CRITICA (Coding Region Identification Tool
Invoking Comparative Analysis) is a suite of programs for identifying likely protein coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis,
regions of DNA are aligned with related sequences from the DNA
databases; if the translation of the aligned sequences has greater
amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from
the relative frequencies of hexanucleotides in coding-frames versus
other contexts (i.e., dicodon bias). The dicodon usage information
is derived by iterative analysis of the data so that CRITICA is not
dependent upon the existence or accuracy of coding sequence annotations in the databases. This independence makes the method
particularly well-suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium
DNA sequences. Its predictions were compared to the DNA sequence annotations and to the predictions of GenMark. CRITICA
proved more accurate than GenMark, and, moreover, many of its
predictions that would seem to be errors, instead reflect problems
in the sequence databases."
</blockquote>

To understand the basic idea, we need to discuss how genomes are
passed on to descendants.  We discuss the notion of replication below,
but for now let us just say that cells occasionally copy their genome
and divide into two cells, leaving a version of the genome in each
cell.  The set of machines in the original cell also gets divided.
How the cell makes sure that each of the new cells gets enough
machines to make up an operational life-form is a separate topic.  For
now, let us just say that they do achieve it.  The new cell containing
a copy of the genome that existed in the original cell may very
occasionally contain a copied genome that differs from the original version due to errors in
copying.  These differences are called <i>mutations</i>.  If a
mutation occurred in a gene (encoding a protein), and if the mutation
caused the encoding to be changed to produce a protein sequence that
would not work, then the mutation is <i>lethal</i> and the cell dies
(whatever that means -- something close to "it does not function well
enough to compete for resources").  On the other hand, it may change
the encoding, but the new version is either just as good, or even
better.  Many of the changes will simply change the DNA, but not the
protein it is used to generate (e.g., it might change <b>GGC</b> to <b>GGA</b>, both
of which are encoding of the amino acid <b>G</b>).
<p>
Most mutations that occur in protein-encoding genes are lethal (the
proteins have been optimized over many, many generations).  The number
that improve the functioning of the encoded protein are relatively
few.  This means that most mutations that alter which amino acid is
encoded do not appear in the sequenced genomes (cells with those
mutations often just die).  A disproportionate number of mutations
will be of the category that leave the encoded sequence of amino acids
unchanged.   
<p>
Let's make this all more concrete and you can try to tie a lot of
these notions together.  Let us begin with a <b>multiple-sequence
alignment</b> of the starts of some genes from closely-related cells:
<pre>

fig|198214.1.peg.4        ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|83333.1.peg.4         ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|331112.3.peg.3        ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|155864.1.peg.4        ATGAAACTCTACAATCTTAAAGATCACAATGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|321314.4.peg.144      ATGAAACTCTATAATCTGAAAGACCATAATGAGCAGGTCAGCTTTGCGCAGGCCGTCACG
                          *********** ***** ***** ** ** ******************** ***** ** 

fig|198214.1.peg.4        CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
fig|83333.1.peg.4         CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
fig|331112.3.peg.3        CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCATGACCTGCCGGAATTCAGCCTG
fig|155864.1.peg.4        CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
fig|321314.4.peg.144      CAAGGACTGGGCAAACAGCAGGGACTTTTTTTTCCGCACGAACTGCCGGAGTTTAGCCTG
                          ** **  ******** * ***** ** *********** ** ******** ** ******
</pre>

We are depicting the initial 120 characters of the DNA encoding the
same corresponding protein from 5 distinct cells.  We have associated
distinct identifiers to the 5 genes (e.g., fig|198214.1.peg.4).  Each
of the genes beginning with <b>ATG</b> which is a codon encoding
<b>M</b>.  The corresponding amino acid strings (that is, the starts
of the proteins encoded by the genes) are as follows:


<pre>
fig|198214.1.peg.4        MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|331112.3.peg.3        MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|83333.1.peg.4         MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|155864.1.peg.4        MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|321314.4.peg.144      MKLYNLKDHNEQVSFAQAVTQGLGKQQGLFFPHELPEFSL
                          ************************* ******* ******
</pre>

Note that we have 120 DNA characters encoding 40 amino acids in each
of 5 closely-related genomes.  Note that the fourth codon in the gene
(TAT in one genome, but TAC in the others) corresponds to the <b>Y</b>
in the fourth position of the amino acid alignment.
We highly recommend that you manually
go through the correspondence between the DNA and amino acid
sequences.  Tabulate the number of mutations that did not alter the
amino acid sequences, as well as the number that did.  Think about
what this means.  It is critical.
<p>
What is important for you to realize is that the authors of CRITICA
had a pretty good idea: with just these five genomes you can rather
reliably recognize that these regions encode amino acid strings.  If
we were to take the 30 characters ahead of the genes (usually called
<i>upstream</i> of the genes) along with the initial ATG we would get
the following alignment of those DNA sequences:

<pre>

fig|198214.1.peg.4        ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|331112.3.peg.3        ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|83333.1.peg.4         ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|155864.1.peg.4        ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|321314.4.peg.144      ACGGCGGGCGCACGAGTAGTGGGATAATCAATG
                          ****************** *** * * * ****
</pre>

When we look at the generated amino acids, we see

<pre>

fig|198214.1.peg.4        TAGARVLENXM
fig|331112.3.peg.3        TAGARVLENXM
fig|83333.1.peg.4         TAGARVLENXM
fig|155864.1.peg.4        TAGARVLENXM
fig|321314.4.peg.144      TAGARVVGXSM
                          ******:   *
</pre>

Here we see some <b>X</b>s in the alignment; they represent <i>stop
codons</i> (i.e., they indicate that the codon does not encode an
amino acid).  What is worth noting is that there are mutations in 5 of
the 30 upstream characters, and 4 of those 5 produced changes in the
encoded characters.  It is a fact that most genes begin with
<b>ATG</b>, which makes it quite likely that this gene begins with the
exact <b>ATG</b> we have shown.
<p>
Now let us return to the topic of gene-calling.  Our basic approach
will be as follows:
<ol>
<li>Begin by attempting to find as many genes as we can by taking the
existing set of genomes and finding protein-encoding sections using
the idea that was used in CRITICA.  This will be computationally
expensive because it might require looking for similar regions in
thousands of genomes (remember there are 499,500 pairwise comparisons
to make for 1000 genomes, and there are thousands of similar genes for
almost all of the pairwise comparisons).  However, what we will get
out is a pretty accurate estimate of which areas of each genome are
actually genes.
<li>A second step, after forming as many accurate predictions as we
can make, would involve polishing things up by taking the set of
predicted genes and trying to
<ul>
<li> make sure the details are consistent (e.g., that the start
positions of corresponding genes seem to be the same) and
<li> that we have not missed any distantly related, short genes.
</ul>
</ol>
A comprehensive recalling of genes should be done periodically,
leading to ever more reliable estimates for ever more thousands of
genomes.  The whole topic of reducing the effort required to do the
incremental comparisons between genomes is obviously going to be
considered over the coming years.  What is important, we suppose, is
that it is clear at this point that we can now accurately call genes
in prokaryotic genomes (although no one has yet gone back and cleaned
up all of the errors in the existing genomes).


<h4>Identifying Similar Genes</h4>
Genes are said to be <span style="font-style: italic;">homologous</span>
if they share a common ancestor. &nbsp;Tools have been developed to
construct estimates of whether or not two genes, or the protein
sequences they encode, are homologous. &nbsp;Most of these are
based on measuring the degree of <span style="font-style: italic;">similarity</span>
between the genes based on some metric. &nbsp;The most basic
versions of this problem are<br>
<br>
<ol>
<li>Given two genes (or proteins), are they homologs?
&nbsp;That is, estimate the liklihood that they are homologs.</li>
<li>Given a gene and a database of other genes, extract a
prioritized list from the database of genes that are likely to be
homologs. &nbsp;Similarly, given a protein sequence and a database
of other protein sequences, which are most likely to be produced by
homologous genes?</li>
<li>Produce an <span style="font-style: italic;">alignment</span>
of two DNA or protein sequences that attempts to show corresponding
characters in the two sequences. &nbsp; For example,<br>
</li>
</ol>
<pre>
fig|226900.1.peg.4136      ------------------ATGAGTAAAATTATCGGTATTGACTTAGGTAC
fig|138677.1.peg.499       ATGAGTGAACACAAAAAATCAAGCAAAATTATAGGTATAGACTTAGGCAC
                                                ** ******** ***** ******** **

fig|226900.1.peg.4136      AACAAACTCTTGTGTAGCTGTTATGGAAGGTGGAGAACCAAAGGTTATCC
fig|138677.1.peg.499       AACAAACTCCTGCGTATCTGTTATGGAAGGAGGACAAGCTAAAGTAATTA
                           ********* ** *** ************* *** ** * ** ** **  

fig|226900.1.peg.4136      CAAATCCAGAAGGGAACCGTACAACACCTTCTGTTGTAGCTTTCAAAAAT
fig|138677.1.peg.499       CATCATCCGAAGGAACAAGAACCACGCCATCGATCGTTGCCTTCAAAGGT
                           **    * ***** *   * ** ** ** **  * ** ** ******  *

fig|226900.1.peg.4136      GAAGAACGTCAAGTTGGGGAAGTTGCAAAGCGCCAAGCAATTACAAACCC
fig|138677.1.peg.499       AATGAGAAATTAGTGGGGATTCCAGCAAAACGTCAAGCAGTGACAAATCC
                            * **      *** ***      ***** ** ****** * ***** **

fig|226900.1.peg.4136      AAATACAA---TCATGTCTGTTAAACGTCATATGGG---TACAGACTACA
fig|138677.1.peg.499       AGAAAAAACTCTCGGCTCTACAAAACGCTTTATTGGCCGTAAGTACTCTG
                           * * * **   **   ***   *****   *** **   **   ***   

fig|226900.1.peg.4136      AAGTAG--------------------------------------------
fig|138677.1.peg.499       AAGTAGCTTCGGAAATCCAAACCGTTCCTTATACAGTCACCTCCGGATCT
                           ******                                            

fig|226900.1.peg.4136      -------------------AAGTTGAAGGTAAAGATTATACACCTCAAGA
fig|138677.1.peg.499       AAAGGTGATGCCGTTTTCGAAGTTGATGGCAAACAATACACTCCAGAAGA
                                              ******* ** *** * ** ** **  ****

fig|226900.1.peg.4136      AATTTCTGCCATCATTTTACAAAACTTAAAAGCTTCTGCTGAAGCATACT
fig|138677.1.peg.499       AATTGGCGCACAAATCTTAATGAAAATGAAAGAGACAGCAGAAGCTTATC
                           ****   **    ** ***   **  * ****   * ** ***** **  

fig|226900.1.peg.4136      TAGGTGAAACAGTAACGAAAGCTGTTATTACAGTACCTGCATACTTCAAC
fig|138677.1.peg.499       TAGGCGAAACTGTCACAGAAGCAGTGATCACCGTCCCCGCATACTTCAAT
                           **** ***** ** **  **** ** ** ** ** ** *********** 

fig|226900.1.peg.4136      GATGCAGAGCGTCAAGCAACGAAAGATGCTGGTCGTATCGCTGGTTTAGA
fig|138677.1.peg.499       GATTCTCAACGAGCATCCACAAAAGATGCTGGACGCATTGCAGGTCTAGA
                           *** *  * **   * * ** *********** ** ** ** *** ****

fig|226900.1.peg.4136      AGTTGAGCGTATCATTAACGAGCCAACAGCAGCAGCACTTGCTTACGGTT
fig|138677.1.peg.499       TGTAAAACGTATCATTCCAGAACCTACCGCAGCAGCTCTTGCCTACGGAA
                            **  * *********   ** ** ** ******** ***** *****  

fig|226900.1.peg.4136      TAGAAAAACAAGACGAAGAACAAAAAATCTTAGTATATGACTTAGGTGGC
fig|138677.1.peg.499       TCGATAA---AGTCGGTGATAAAAAAATCGCTGTCTTCGACCTTGGTGGA
                           * ** **   ** **  **  ********   ** *  *** * ***** 
</pre>

When two characters are in the same column, the implication is that we
believe that they derived from the same character in an ancestral
sequence.  When a dash (i.e., a <b>-</b>) appears in a column, it indicates that we
believe that
<ul>
<li>the ancestral sequence had a character which corresponds to a
character in one of the sequences, but the other sequence lost a
characterin the evolutionary process, or
<li>
the ancestral sequence did not have a character in this position, but
a new one was inserted for one of the two sequences.
</ul>

<h4>Multiple-Sequence Alignment</h4>

A multiple-sequence alignment extends the notion of a binary
alignment.  We have already used them in discussing the problem of
identifying the genes in genomes, but they represent a fundamental
source of comparative insight and come into play in almost every
aspect of analyzing genomic sequences.

Consider the following piece of a multiple-sequence alignment:

<pre>
fig|226900.1.peg.4136      -------------------MSKIIGIDLGTTNSCVAVME-GGEPKVIPNP
fig|95665.5.peg.505        ----------------------------------MAVIE-NKKPIVLENP
fig|138677.1.peg.499       -------------MSEHKKSSKIIGIDLGTTNSCVSVME-GGQAKVITSS
fig|243274.1.peg.368       ---------------MAEKKEFVVGIDLGTTNSVIAWMKPDGTVEVIPNA
fig|349521.5.peg.4864      MIRKIAVFSFLRANRGFQSSMSLIGIDLGTTNSLIAHWG-EQGVEIIPNR
fig|397945.5.peg.3653      -----------------MEQKMIIGIDLGTTNSLVAAWK-DGRSVLIPNA
                                                             ::         :: . 

fig|226900.1.peg.4136      EGNRTTPSVVAFK-NEERQVGEVAKRQAITNPN-TIMSVKRHMG------
fig|95665.5.peg.505        EGKRTVPSVVSFN-GDEVLVGDAAKRKQITNPN-TVSSIKRLMG------
fig|138677.1.peg.499       EGTRTTPSIVAFK-GNEKLVGIPAKRQAVTNPEKTLGSTKRFIGRKYSEV
fig|243274.1.peg.368       EGSRVTPSVVAFTKSGEILVGEPAKRQMILNPERTIKSIKRKMG------
fig|349521.5.peg.4864      LGARLTPSAVSLDADGAVIVGQAAKDRLVTHPDLSVASFKRRMG------
fig|397945.5.peg.3653      LGETLTPSCVSLDEDVTVLVGRAARERLQTHPDRTAANFKRYMG------
                            *   .** *::  .    **  *: :   :*: :  . ** :*      

fig|226900.1.peg.4136      ----------------TDYKVEVEGKDYTPQEISAIILQNLKASAEAYLG
fig|95665.5.peg.505        ----------------TKEKVTILNKEYTPEEISAKILSYIKDYAEKKLG
fig|138677.1.peg.499       ASEIQTVPYTVTSGSKGDAVFEVDGKQYTPEEIGAQILMKMKETAEAYLG
fig|243274.1.peg.368       ----------------TDYKVRIDDKEYTPQEISAFILKKLKNDAEAYLG
fig|349521.5.peg.4864      ----------------TNAAYTLGKQSFRPEELSALVLKQLKEDAEAYLN
fig|397945.5.peg.3653      ----------------SDRTVALAGRAFRPEELSSLVLRALKADAEAFLG
                                            .    :  : : *:*:.: :*  :*  **  *.

</pre>

In actuality, these five sequences are part of a set of sequences that
are fairly similar, and
recognizably so.  However, we believe that it is far from clear that
the alignment above is actually "correct" or "optimal" in a meaningful
sense.  Rather, it seems probably close to correct, but containing
errors.  Exactly where the dashess (called <i>indels</i>, since they
represent characters that were either inserted or deleted) should be
placed is uncertain.
<p>
There are two classes of problems associated with multiple-sequence
alignments: 
<ol>
<li>how to compute them and 
<li>how to use them.
</ol>
Some of the most interesting problems are of the second sort -- using
multiple-sequence alignments in what might be called <i>molecular
archaeology</i> to uncover events in the evolutionary history of the
sequences that occur in the alignment.  

On the other hand, one of the more important problems in bioinformatics
is, as we accumulate collections of thousands of
homologous sequences, the development of tools to support the
construction and use of these
multiple-sequence alignments.
<p>
Before we leave this topic, we will briefly describe a tool that we
believe any computer scientist could build easily and that would
reveal numerous research topics.  Suppose that we have a single genome
that we wish to analyze, and that we have computed all regions of
similarity between sections of this genome and other complete genomes.
For each character in the genome we are focused on, we can easily
extract all regions in other genomes that are similar to regions in
the focus genome that contain the given character.  Further, each of
the stored similarities (between a region in the given genome and one
of the other genomes) has an associated <i>percent identity</i> (a
measure of how similar the regions are - the percent of the aligned
characters that are identical).  Now, the utility that is needed is
the ability to specify a region in the given genome, along with a
range of desired similarities, and then the program would display the
alignment composed of the selected similarity range (maybe with some
representation of the consensus and how conserved the values are).
<br>

<h3> Given a multiple sequence alignment, determine the most
likely evolutionary history of the sequences (i.e., construct a
phylogenetic tree).</h3>

Here is an example of a multiple sequence alignment:
<br>
<br>
<pre>
seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE

seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL

seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV

seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA

seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI

seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------

seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG

seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS

seq3 FVSQHGNRGKPL
seq4 FMSGHLGA----
seq5 FIEKKAL-----
seq1 LMMNHQ------
seq2 YLLGK-------
</pre>

From the extant five sequences that are similar and displayed in the
previous alignment, we can construct
a tree that depicts the "phylogenetic history" of the sequences.
Here is one reasonable tree for the last 5 sequences.
<pre>
  ,----------------------- seq5
  |
  |
 -|
  |
  |
  |                                         ,---------------------------- seq3
  |                                         |
  |                                         |
  |                       ,-----------------|
  |                       |                 |
  |                       |                 |
  |                       |                 `--------------------------- seq4
  |                       |
  |                       |
  |                    ,--|
  |                    |  |
  |                    |  |
  |                    |  `---------------------------------------------- seq1
  |                    |
  |                    |
  `--------------------|
                       |
                       |
                       `------------------------------------------------ seq2
</pre>
The tree suggests that at some point an ancestral
cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>,
while the remaining sequences descend from the other copy.
<p>Note that we now have alignments that
contain thousands of sequences, and even displaying such trees is
nontrivial.
Because evolution plays such a central role in the phenomena we study,
the construction of alignments
and trees in order to compare extant versions of proteins and gain
insight into their historical origins
is considered basic to the task at hand.
</p>
<h4>What is "the tree of life" and How Might it Get Built?</h4><br>The
problem of constructing a single phylogenetic tree from a single
alignment (the last problem) is relevant to this issue, but it does not
cover it. &nbsp;Suppose that you built 200 alignments &nbsp;that
contain the sequences common to almost all genomes. &nbsp;Then, if you
were to build 200 trees, and then you found that they were not
identical (or even close in some cases), what would you infer, and how
should you respond? &nbsp;Is it even possible or desirable that we
actually create an estimate of the history of how the existing
micro-organisms have evolved from some ancestral organism?<br><br>

<h4>Assuming that We Do Have an Estimate of the Tree of Life, which Proteins Characterize Subdivisions of the Tree?</h4>It
is clear that sequences are introduced into genomes through replication
and (in addition) through horizontal transfer. &nbsp;In the presence of
large amounts of horizontal transfer, many genes will occur only in
relatively small portions of a specific subtree (these represent
relatively recent transfers). &nbsp;Is it possible and meaningful to
create inventories of proteins that tend to be unique to a subtree (or
is the concept "tend to be unique" somewhat similar to "a little
pregnant")?<br>

<h4>Can We Identify Instances of Horizontal Transfer?</h4>How
can we construct tools to recognize horizontal transfer, and can these
tools be good enough to sort out the actual details of the evolutionary
history?<br><br>

<h4>Can We Determine Which Columns and Sections of a Multiple-Sequence Alignment are Conserved (and Why)?</h4>Conservation
normally implies functional constraints (the reason a column has
restricted content is that any evolutionary change &nbsp;led to the
death of the organism that had it). &nbsp;Shifts of function relate to
conserved sections that have changed (i.e., the sections are not
random, but neither are they identical). &nbsp;The correspondence
between conservation and function is a rich source of significant
problems.<br><br>

<h4>To What Extent Can Structure (Secondary or Tertiary) be Predicted froma Multiple-Sequence Alignment?</h4>Comparison
of columns in a large multiple sequence alignment was the key to
developing secondary structures for both DNA alignments and protein
alignments.<span style="font-family: monospace;"><br></span>
<h2>The Machines: a Initial Inventory</h2>

<h3>Energy Issues</h3>
The following diagram offers a summary of the machines that relate to
acquisition and storage of energy, as well as the production of a
number of key compounds by breaking up sugar:<br>
<br>
<br>
<img style="width: 621px; height: 612px;" alt="" src="energy.jpg"><br>
<br>
&nbsp;&nbsp; &nbsp;
<table style="text-align: left; width: 411px; height: 156px;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td>M1</td>
<td>harvesting light energy</td>
</tr>
<tr>
<td>M2</td>
<td>building sugar from smaller components and energy</td>
</tr>
<tr>
<td>M3</td>
<td>Storing strings of sugar molecules as starch</td>
</tr>
<tr>
<td>M4</td>
<td>breaking up starch to give sugar</td>
</tr>
<tr>
<td>M5</td>
<td>breaking up sugar to get energy and smaller molecules</td>
</tr>
</tbody>
</table>
<br>
Many of our machines will need energy to run. &nbsp;In the basic
organism we are describing, we have incuded <span style="font-weight: bold;">M1</span> to harvest energy
from sunlight. &nbsp;This process is called <span style="font-style: italic;">photosynthesis</span>.
&nbsp;The cell stores energy in a molecule called <span style="font-weight: bold;">ATP</span>.
&nbsp;Whenever energy is needed, the molecule is broken into two
pieces, releasing energy. &nbsp;The cell maintains a fairly
constant concentration of ATP, which allows reactions throughout the
cell to depend on it. &nbsp;This is similar in many respects to the
way electricity is available throught an house. &nbsp;Appliances
can be designed to plug in anywhere, and they assume the normal voltage
will be available. &nbsp;Similarly, we have a mechanism for
maintaining the concentration of ATP, and this allows us to include
reactions that depend on that concentration.<br>
<br>
<span style="font-weight: bold;">M2</span> is a
machine that builds sugar from CO2 and energy. &nbsp;This involves
a number of transformations. &nbsp;Eventually, we will need to
examine the individual steps, but for now let us remain at this quite
abstract level.<br>
<br>
Machines <span style="font-weight: bold;">M3</span>
and <span style="font-weight: bold;">M4 </span>&nbsp;allow
the cell to store sugars when energy is abundant, and then to use them
later when energy is needed. &nbsp;Starch should be thought of as
just a string of sugar molecules, which is a convenient way to store
them. &nbsp;When sugar is needed, <span style="font-weight: bold;">M4</span> can be used to
break off a few.<br>
<br>
Finally, <span style="font-weight: bold;">M5</span>
is a machine that takes sugar molecules and breaks them into smaller
pieces, releasing energy (in the form of ATP) in the process.
&nbsp;These smaller molecules are the building blocks that are used
 &nbsp;over and over to build things needed by the
cell. &nbsp;Here is a table that contains the abbreviations we use
for these molecules. &nbsp;Frankly, if you have not had
biochemistry classes, you might simply work with the abbreviations,
since the full names can be intimidating.<br>
<br>
<table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td>2OG</td>
<td>2-oxoglutarate</td>
</tr>
<tr>
<td>3PG</td>
<td>3-phospho-glutarate</td>
</tr>
<tr>
<td>A</td>
<td>Adenosine [one of the characters in a DNA string]</td>
</tr>
<tr>
<td>Ala</td>
<td>Alanine [an amino acid]</td>
</tr>
<tr>
<td>Arg</td>
<td>Arginine [an amino acid]</td>
</tr>
<tr>
<td>Asn</td>
<td>Asparagine [an amino acid]</td>
</tr>
<tr>
<td>Asp</td>
<td>Aspartate [an amino acid]</td>
</tr>
<tr>
<td>C</td>
<td>Cytosine [one of the characters in a DNA string]</td>
</tr>
<tr>
<td>CHOR</td>
<td>Chorismate</td>
</tr>
<tr>
<td>CO2</td>
<td>Carbon dioxide</td>
</tr>
<tr>
<td>Daughter genome</td>
<td>the added cell after replication</td>
</tr>
<tr>
<td>E4P</td>
<td>Erythrose 4-phosphate</td>
</tr>
<tr>
<td>Extra Membrane</td>
<td>A little extra membrane for the new cell</td>
</tr>
<tr>
<td>G</td>
<td>Guanine [one of the characters in a DNA string]</td>
</tr>
<tr>
<td>G6P</td>
<td>Glucose 6-phosphate</td>
</tr>
<tr>
<td>Genome</td>
<td>the DNA string in the cell that contais the genes</td>
</tr>
<tr>
<td>Gln</td>
<td>Glutamine [an amino acid]</td>
</tr>
<tr>
<td>Glu</td>
<td>Glutamate [an amino acid]</td>
</tr>
<tr>
<td>Gly</td>
<td>Glycine [an amino acid]</td>
</tr>
<tr>
<td>HOM</td>
<td>Homoserine</td>
</tr>
<tr>
<td>His</td>
<td>Histidine [an amino acid]</td>
</tr>
<tr>
<td>Iso</td>
<td>Isoleucine [an amino acid]</td>
</tr>
<tr>
<td>Leu</td>
<td>Leucine [an amino acid]</td>
</tr>
<tr>
<td>Lys</td>
<td>Lysine [an amino acid]</td>
</tr>
<tr>
<td>Membrane</td>
<td>the thing enclosing the cell</td>
</tr>
<tr>
<td>Met</td>
<td>Methionine [an amino acid]</td>
</tr>
<tr>
<td>OXLA</td>
<td>Oxalacetate</td>
</tr>
<tr>
<td>PEP</td>
<td>Phosphoenolpyruvate</td>
</tr>
<tr>
<td>PYR</td>
<td>Pyruvate</td>
</tr>
<tr>
<td>Phe</td>
<td>Phenylalanine [an amino acid]</td>
</tr>
<tr>
<td>Pro</td>
<td>Proline [an amino acid]</td>
</tr>
<tr>
<td>R5P</td>
<td>Ribose 5-phosphate</td>
</tr>
<tr>
<td>Ser</td>
<td>Serine [an amino acid]</td>
</tr>
<tr>
<td>Starch</td>
<td>A polymer of sugars (used for storage)</td>
</tr>
<tr>
<td>Sugar</td>
<td>think glucose</td>
</tr>
<tr>
<td>T</td>
<td>Thiamine [one of the characters in a DNA string]</td>
</tr>
<tr>
<td>Thr</td>
<td>Threonine [an amino acid]</td>
</tr>
<tr>
<td>Trp</td>
<td>Tryptophane [an amino acid]</td>
</tr>
<tr>
<td>Tyr</td>
<td>Tyrosine [an amino acid]</td>
</tr>
<tr>
<td>Val</td>
<td>Valine [an amino acid]</td>
</tr>
</tbody>
</table>
<br>
<br>
<h3>Building the Amino Acids</h3>
<img style="width: 576px; height: 529px;" alt="" src="AA1.jpg"><br>
<table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td>M6</td>
<td>build glutamate and glutamine &nbsp;from
2-oxoglutarate</td>
</tr>
<tr>
<td>M7</td>
<td>build proline from glutamate and ATP</td>
</tr>
<tr>
<td>M8</td>
<td>build aspartate from 2-oxalacetate</td>
</tr>
<tr>
<td>M9</td>
<td>build arginine from glutamate, aspartate, and ATP</td>
</tr>
<tr>
<td>M10</td>
<td>build asparagine from glutamine, aspartate, and ATP</td>
</tr>
<tr>
<td>M11</td>
<td>build serine from 3-phospho-glutarate and glutamate</td>
</tr>
</tbody>
</table>
<br>
<img style="width: 512px; height: 665px;" alt="" src="AA2.jpg"><br>
<br>
<table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td align="undefined" valign="undefined">M12</td>
<td align="undefined" valign="undefined">build
glycine from serine</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M13</td>
<td align="undefined" valign="undefined">build
cysteine from serine</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M14</td>
<td align="undefined" valign="undefined">build
methionine from homoserine and cysteine</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M15</td>
<td align="undefined" valign="undefined">build lysine from pyruvate and aspartate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M16</td>
<td align="undefined" valign="undefined">buil
homoserine from aspartate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M17</td>
<td align="undefined" valign="undefined">build threonine from homoserine and ATP</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M18</td>
<td align="undefined" valign="undefined">build isoleucine from glutamate, threonine and pyruvate</td>
</tr>
</tbody>
</table>
<br>
<img style="width: 563px; height: 651px;" alt="" src="AA3.jpg"><br>
<br>
<table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td align="undefined" valign="undefined">M19</td>
<td align="undefined" valign="undefined">build alanine from pyruvate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M20</td>
<td align="undefined" valign="undefined">build valine from pyruvate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M21</td>
<td align="undefined" valign="undefined">Build leucine from pyruvate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M22</td>
<td align="undefined" valign="undefined">build the intermediate &nbsp;chorismate from phosphoenolpyruvate and erythrose 4-phosphate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M23</td>
<td align="undefined" valign="undefined">build tyrosine and phenaylalanine from glutamate and chorismate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M24</td>
<td align="undefined" valign="undefined">build tryptophane from chorismate and glutamine</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M25</td>
<td align="undefined" valign="undefined">build ribose 5-phosphate from glucose-6-phosphate</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M26</td>
<td align="undefined" valign="undefined">build histidine from ribose-5-phosphate and ATP</td>
</tr>
</tbody>
</table>
<br>
<h3>Expressing Genes</h3>
<img style="width: 374px; height: 430px;" alt="" src="./expression.jpg"><br>
<br>
<table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td align="undefined" valign="undefined">M30</td>
<td align="undefined" valign="undefined">building
a protein from amino acids and a gene</td>
</tr>
</tbody>
</table>
<br>
<b>M30</b> is a complex machine that we have not represented all that
well.  It exists in the cell, and you might imagine the cell as
containing free-floating amino acids (which are built by the machines
discussed above).  <b>M30</b> can take the description of a protein
encoded in a gene and build the protein from the instructions and the
free-floating amino acids.  It is certainly a complex and incredible
machine, and it exists as a central component of the life forms we are
studying. 

<h3>Motility</h3>The cell we envision has some motility. &nbsp;It can
"turn on its motor and propellers" to move a bit, turn off the motility
machinery, wait a while, turn it on again, and so forth.<br>We do not show a diagram or table of this machine, but we shall number it <span style="font-weight: bold;">M31</span>.<h3>Replication</h3>
<br>
Replication is descriibed in a somewhat imprecise manner. &nbsp;We think of <span style="font-weight: bold;">M27</span> as a machine that builds the <span style="font-weight: bold;">nucleotides</span>, which are the characters that make up the DNA genome. &nbsp; Then <span style="font-weight: bold;">M28</span>
is a machine that takes these loose "characters" floating in the cell,
along with the existing genomes, and manufactures a copy of the genome.
&nbsp; Then, finally, <span style="font-weight: bold;">M29</span> takes some extra membrane (see the output of <span style="font-weight: bold;">M5</span>),
the genome copy, and "pinches" the extended cell, creating two separate
cells which we call the "original" (containing the original genome) and
the "daughter" containing the copiy of the genome).<br>
<h2><img style="width: 503px; height: 674px;" alt="" src="./replication.jpg">&nbsp;</h2>
<br>
<br>
<table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
<tbody>
<tr>
<td align="undefined" valign="undefined">M27</td>
<td align="undefined" valign="undefined">build
nucleotides</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M28</td>
<td align="undefined" valign="undefined">build
new genome</td>
</tr>
<tr>
<td align="undefined" valign="undefined">M29</td>
<td align="undefined" valign="undefined">split
the cell into original and daughter</td>
</tr>
</tbody>
</table><br><h2>Problems in BioInformatics that Can Be Done Once the Notion of "Function" Exists</h2><br>The
inventory of machines has led us (albeit circuitously) into a
discussion of "the function of a protein" and how to think about it.
&nbsp;These problems relate to the use of comparative analysis between
the protein sequences from many distinct genomes (and what clues we can
expect to develop in our attempts to make sense of it all).<br><br>

<h4>Identifying the Functions of Genes</h4>The
general topic of how assign function to genes is central to genome
annotation. &nbsp;Deciding when you can safely project function based
on similarity is a topic that can profitably be pondered.
<p>
Before leaving this topic, it is worth noting that a site called 
<a href=http://clearinghouse.nmpdr.org/aclh.cgi>The Annotation Clearinghouse</a>
exists.  This resource will allow users to download assertions of
function that are considered to be reasonably reliable by human
annotators manually curating the growing body of data.  The assertions
use widely differing IDs for genes (but a table for interconverting
the IDs is provided), they use an uncontrolled vocabulary (although
progress is being made in developing synonym lists), and many of the
assertions are undoubtedly wrong.  However, it is a start on a
resource of central importance.
<br><br>

<h4>Predicting When Two Genes Implement Related Functions</h4>There
are many clues that can be used to improve the accuracy of function
projection. &nbsp;Conservation of contiguity, detection of gene
fusions, protein-protein interaction data, and characterization of
regulatory sites have all proven useful &nbsp;Integration of clues from
a number of sources has been attempted (and will undoubtedly be
important in the future).<br>
<p>
In our view, the most useful set of clues to date have arisen from
recognizing that genes that implement closely related functions (i.e.,
functions that are part of the same machine or machines that implement
connected functions) often occur close to one another in the genome.
That is, if you take the genes that implement a machine, and you look
at where these genes occur in the genome, the occurrences are not
random.  On average, about 50% of the genes that make up a machine will occur within
5000 characters of one another in the genome.  In some genomes far
fewer genes cluster (for reasons we do not fully understand).
<p>
To exploit this tendency, we might construct sets of pairs of genes.
All pairs in a set occur close together in a genome (one of the ones
in our collection).  All of the first members of pairs are similar to
one another, and all of the second members are similar to one another.
The fact that all of the 2-tuples in each set have corresponding pairs
that are similar might lead one to believe that all of the pairs
implemented the same two abstract functions, but that is not the
case.  It is often, and perhaps usually, the case; but, there are many
instances where the pairs implement distinct functions.  For example,
there are many cases in which 4 close genes implement a transport
machine.  For each of these transport machines, even though they
transport completely different compounds, 3 of the 4 genes are pretty
similar.  The fourth gene is often the one that is specific to the
compound being transported.  
<p>
What we can say, assuming that we find enough entries in a set (that
is way more coresponding pairs than one would expect by random), is
that the functions of the genes in each pair are related.  We cannot
say with reliability that the actual functions in all of the pairs
match up, but the ones in each pair will usually be related.
<p>
Further, a single protein might well participate in pairs from several
sets.  By combining the evidence from all of these sets of pairs, it
is possible to produce an estimate of all of the components in a
machine, without really knowing the functions of any of them.  That
is, it becomes possible to say "I think that these four genes
implement a machine", and to do so without having a clear idea of what
the machine actually does.
The information produced by examining conserved contiguity has not
really been completely exploited.  It has proved to be immensely
useful, but there is far more to be gleaned from this data by those
with some minimal creativity and statistical competence.
<p>

<h4>Grouping Genes into Subsystems</h4>The genes that encode proteins that together implement a single machine may be thought of as an instance of a <span style="font-style: italic;">subsystem</span>.
&nbsp;In later tutorials we will discuss the notion of subsystem in
more detail. &nbsp;Essentially, it is an abstraction of the notion of
machine, and it represents an important conceptual framework for
analyzing the functions of genes from many genomes simultaneously.
&nbsp;So, how can you detect when two genes are components of the same
machine?<br><br>

<h4>Constructing Sets of Isofunctional Homologs</h4>Homologs
are genes that share a common ancestor. &nbsp;Isofunctional genes
implement the same function. &nbsp;The goal of compiling sets of
homologous genes (and the proteins they encode) that implement a single
function is central to automating annotation of genomes. &nbsp;Further,
since we will be faced with annotating thousands of new genomes over
the next few years (and it increases much more rapidly after that),
almost all annotations will be automated.<br><br>

<h4>Supporting Decision Procedures for Sets of Isofunctional Homologs</h4>Suppose
that you have a collection of sets of isofunctional homologs.
&nbsp;Suppose further that you have, say, 10,000 of these sets.
&nbsp;For each set, you will wish to develop a decision procedure
which, when given&nbsp;as input a set and a new protein sequence,
determines whether or not the protein should be added to the set.
&nbsp;In some cases, such decisions are easy, and you will wish to use
a very fast decision procedure. &nbsp;In others, they are very
difficult, and you will need to bring many sources of clues to
bear.<br>Construction of such decision procedures will become
increasingly important.<br><br>

<h4>Characterization of Regulons for a Genome</h4>Genes
are often co-regulated. &nbsp;That is, expression of a set of genes may
always be tightly coordinated. &nbsp;In this case, we will think of the
co-regulated set as a <span style="font-weight: bold;">regulon</span>.
&nbsp;Determination of which genes make up which regulons is a task
requiring both bioinformatic challenges and wet lab confirmations.
&nbsp;Don't attempt this one without a close working relationship with
a wet lab biologist.<br><br><span style="font-weight: bold;">Charaterization of "States of the Cell"<br><br></span>It might be conjectured that a cell has a limited set of <span style="font-weight: bold;">states</span>.
&nbsp;Each state is characterized by the set of regulons that are
expressed. &nbsp;It seems likely that the cell should be viewed as
"tending to stay in the same state" until forced to make a transition
to another state. &nbsp;That is, the states demonstrate a degree of <span style="font-style: italic;">homeostasis.</span>
&nbsp;If we underatnd a comprehensive list of states, and we worked out
the forces that determine transitions, we would begin to understand the
cell as a dynamic system.<br>

</body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3