[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

View of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (download) (as text) (annotate)
Mon Oct 1 20:12:41 2007 UTC (12 years, 5 months ago) by overbeek
Branch: MAIN
add tut_abs.html

<div align=center>
<h1>The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:</h1> 
<br>
<h1>An Abstract View</h1>
<h2>by Ross Overbeek</h2>
</div>

<h2>What Is a Cell?</h2>

A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
<p>
By the term <b>compound</b> I refer to the normal notion of chemical compound. 
<p>

A <b>cellular machine</b> is a set of proteins that together perform a function.   This function is often t
transform a set of compounds into another set.  Some types of machines (transport machines) 
are used to move compounds into
or out of the cell.
<p>

A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
<p>

A <b>genome</b> is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).
<p>

A <b>gene</b> is a region in the genome that describes how to build a
protein.  The description is a sequence of 3-character codons.  Each
codon corresponds to either a single amino acid or a stop codon.
There are three stop codons: {TAA,TAG,TGA}.  The genetic code is the
table of correspondences between codons and amino acids:
<br><br>
<table border>
<tr><th>Amino Acid</th><th>Codons</th></tr>
<tr><td>A</td> <td>GCT, GCC, GCA, GCG </td></tr>
<tr><td>C</td> <td>TGT, TGC</td></tr>
<tr><td>D</td> <td>GAT, GAC</td></tr>
<tr><td>E</td> <td>GAA, GAG</td></tr>
<tr><td>F</td> <td>TTT, TTC</td></tr>
<tr><td>G</td> <td>GGT, GGC, GGA, GGG</td></tr>
<tr><td>H</td> <td>CAT, CAC</td></tr>
<tr><td>I</td> <td>ATT, ATC, ATA</td></tr>
<tr><td>K</td> <td>AAA, AAG</td></tr>
<tr><td>L</td> <td>TTA, TTG, CTT, CTC, CTA, CTG</td></tr>
<tr><td>M</td> <td>ATG</td></tr>
<tr><td>N</td> <td>AAT, AAC</td></tr>
<tr><td>P</td> <td>CCT, CCC, CCA, CCG</td></tr>
<tr><td>Q</td> <td>CAA, CAG</td></tr>
<tr><td>R</td> <td>CGT, CGC, CGA, CGG, AGA, AGG</td></tr>
<tr><td>S</td> <td>TCT, TCC, TCA, TCG, AGT, AGC</td></tr>
<tr><td>T</td> <td>ACT, ACC, ACA, ACG</td></tr>
<tr><td>V</td> <td>GTT, GTC, GTA, GTG</td></tr>
<tr><td>W</td> <td>TGG</td></tr>
<tr><td>Y</td> <td>TAT, TAC</td></tr>
<tr><td>*</td> <td>TAG, TGA, TAA  [Stop codons]</td></tr>
</table>
<br><br>
<hr>
This minimal notion of a cell is enough to explain some of the central
problems in bioinformatics:

<h3>Identify the genes within a genome</h3>

This problem simply involves taking a genome (a string of DNA) and locating
the set of genes it contains.  Does the existence of 100s of genomes (genomes
with at least some estimate of where the genes occur) effect how you might do this?

<h3>Given two proteins. "align" them in a way that minimizes some edit function.  </h3>

For example:
<br>
<br>
<pre>

seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
                                   ** *. :.:   .*: :**.:**..::***:*  :  :.

seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG
seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG
                  **: :*.*   : :: .**:*:::* * *:* *  :: * .*:. *:.:: .******

seq1            PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP
seq2            PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP
                *******.******::* ::    * * *.:*.*******:***:.* *: .::* :***

seq1            RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR
seq2            KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR
                :**:* :*** * *** ** :: :** ** ****** ** *:**:  * *.******::*

seq1            LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND
seq2            FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND
                :*.*  *** * *** :  * : :*:.*******::****:.*.:****:*****.* **
</pre>

shows an alignment of two proteins (called <i>seq1</i> and <i>seq2</i>).

<h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>

Here is an example of a multiple sequence alignment:
<br>
<br>
<pre>
CLUSTAL W (1.83) multiple sequence alignment


seq3            -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
seq4            -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
seq5            -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
                                   *.  . .      .: :.:  **..: ** .*  :  :

seq3            EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
seq4            ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
seq5            TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL
                       * :   : .:   :.   ..   :.. :  :         .*:   .     *

seq3            ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
seq4            ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
seq5            ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
seq1            ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
seq2            ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV
                ************.. :::..::  .    * : : :: *******:*.  .      :.:

seq3            FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
seq4            FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
seq5            FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
seq1            VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
seq2            YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA
                 :::*:.: * :*   : *:   .:  :   * ** ** :**  * *  :      * :.

seq3            NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
seq4            NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
seq5            NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
seq1            NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
seq2            NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI
                **** :*.:.*  *** *  ::      .  . .**:****:: ** ..  :***: :::

seq3            VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
seq4            IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
seq5            LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
seq1            AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
seq2            AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------
                 *.* *: . : :  . :       ::: :**:  ..*: * :  *.

seq3            FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
seq4            FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
seq5            LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
seq1            -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
seq2            -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG
                                      .    : :                :** * .  *  :

seq3            RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
seq4            GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
seq5            VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
seq1            LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
seq2            LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS
                        : * ****.** :::    :*     *  *            :  :   .:

seq3            FVSQHGNRGKPL
seq4            FMSGHLGA----
seq5            FIEKKAL-----
seq1            LMMNHQ------
seq2            YLLGK-------
                 :  :
</pre>

<h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>

Here is one reasonable tree for the last 5 sequences.  Note that we now have alignments that
contain thousands of sequences, and even displaying such trees is nontrivial.
<pre>
                     ,--------------------------------------------------- seq1
                     |
                     |
  ,------------------|
  |                  |
  |                  |
  |                  `---------------------------------------------- seq2
  |
  |
  |
  |
  |
  |             ,-------------------------------- seq3
  |             |
  |             |
  |-------------|
  |             |
  |             |
  |             `------------------------------ seq4
  |
  |
  `---------------------------------------------- seq5
</pre>

This is an <i>unrooted tree</i>, since we have no idea just looking at extant
sequences about where the root should lie.

<h2>Some Random Facts that You Should Absorb</h2>

Most genomes of bacteria contain between 400,000 and 12,000,000 characters.  
Normally, the genes in a genome
cover abut 90% of the genome.
Normally, there is about one gene per 1000 characters in a bacterial genome.
<p>
So, 
<ul>
<li> What is the length of the average protein sequence?  
<li>How many genes do these
genomes have?  
<li>What is the average length of a gene?
</ul>
<br>
It is worth spending just a short bit of time thinking about what types of cellular
machines must exist.  Here are a few thoughts to start with
<ul>
<li>
There must be one or more machines that support replication of the cell.  You would
need something to copy the genome, and you would need something that could build the DNA
bases that represent the characters (i.e., you will need machines to build the molecules
corresponding to each of the four characters in the alphabet of DNA bases.
<li>
As we mentioned, you have transport machines that take things into and out of the cell.  Many
cells can import food in the form of sugar molecules.  For example, many cells can import
<i>glucose</i> a six-carbon compound.  As the compound gets broken down into smaller compounds,
energy is salvaged from the broken bonds to power the machines in the cell.  The smaller compounds
are used as building blocks for other needs.
<li>
There must be one or more machines involved in building proteins from the descriptions in te genes.
In particular, we will need a machine for each of the amino acids (unless the cell can import some
of them).
<li>
There must be mechanisms for sensing what is going on in the environment and allowing the cell
to react to it.  For example, many cells can "swim" towards food.
</ul>
Those were just a few examples.  For any cell, we have many, many machines, and we still
do not even understand what some of them do.
<p>
About 50-60% of the genes occur within 5000 characters of another gene such that
the two genes encode proteins that are part f the same cellular machine.  If you
had a genome in which the genes were identified, but the correspondence between the encoded
proteins and cellular machines was completely unknown, what could you learn using this fact?
Is the situation significantly different if you have 1000 genomes (let us say that
you know where the genes occur, but the correspondence between the proteins and cellular machines
is completely unknown in each case).
<p>
Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and
in most cells in which the proteins are not fused, the two distinct proteins are separate components
of a single machine.  How wuld you go about locating fused genes, and what could you learn from them?
<p>
Biologists have figured out the roles of about 50% of the genes.  That is, they can
place the gene in a cellular machine, they know what the machine does, and they know
the specific role of the gene in sustaining the functionality of the machine.
<br><br>

<h2>Imposing a Structure on Characterizing the Inventory</h2>

One central goal of bioinformatics is to support an accurate characterization of the cellular
machinery for each cell.  It is of major importance to biologsts that we be able to support
comparative analysis of cells.  Perhaps, the most important aspect of understanding cells relates to
their origin in an evolutionary process.  Cells have a long evolutionary history dating back billions of
years.  The machines we see in cells today arose in the past, so we expect to see many current cells
using machinery that resembles what turns up in other cells.  When we compare machines from different
cells they often look remarkably similar.  On the other hand, those that had a common origin in a cell that existed 
billions of years in the past may now have versions that are not very similar.  Modifications, optimizations,
and insignificant alterations all combine to explore the space of operational possibilities for
each type of machine.  Hence, we need a framework for studying similarities and differences in the
cellular machines and the proteins that implement them.
<p>

Here is a short formulation of one way to do this:
<br><br>
<ul>
<li>A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.
<li>Each protein implements one or more functional roles.  The set of functional roles
implemented by the protein is called the <b>function of the protein</b>.  The function of a  multifunctional
protein that implements {functional-role-1,functional-role-2} is normally written as
<i>functional-role-1 / functional-role-2</i>.
<br><br>
<li>A <b>populated subsystem</b> is a subsystem with an attached spreadsheet.  Each column
in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to
a specific genome.  Each cell in the spreadsheet contains the genes from the corresponding genome
that implement the designated functional role (there may be 0 or more such genes).
</ul>
<br><br>
We do not actually know what machines are present in a cell.  We are in the midst of a grand
effort to clarify which are there and what they do.  The formulation of subsystems as abstract machines
in which each row of the subsystem describes a specific cellular machine that is believed to be present,
represents a way to maintain a collection of estimates or assertions.
<p>
A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
are similar over the entire lengths of the proteins.
<p>
We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
The computational tasks imposed by such a goal are obvious:
<ul>
<li>We need to consruct databases that implement at least the following entities:
<ol>
<li>cells (i.e., each cell must have an ID and a set of attributes),
<li>genomes,
<li>genes,
<li>proteins,
<li>functional roles,
<li>subsystems, and
<li>protein families.
</ol>
<li> We need to add support for developing clues to function by integrating data
from sources like proximity within the genome, fusions, etc.
<li>We need to support a framework for the development of populated subsystems.
<li>We need to construct decision procedures for membership in protein families.  Some 
of these procedures will be quite complex, although the majority of cases can be
handled by fairly general procedures.
</ul>

<h2>Microarrays and States of the Cell</h2>

We wil think of a <b>regulon</b> as a set of subsystems.  A <b>state of the cell</b> is
defined as the set of regulons that are operational at a point in time.
<p>
A <b>consistent microarray</b> (for our purposes) is
<ol>
<li> The ID of an experiment.  The experiment corresponds to two states of the cell, S1 and S2.
<li> A list of proteins that are in in the regulons in S1, but not in those of S2.
<li> A list of proteins that are in the regulons of S2, but not in those of S1.
</ol>
<br>
A <b>real microarray</b> is just two sets of proteins.  We have some notion (e.g., an ID) for
each of two states of the cell, but no idea what regulons make up these states.  There is a
substantial error rate in the two lists of proteins (e.g., some of the proteins in the first list either were not in
S1 or they were in S2).
<p>
The interesting research question is, given a large list of real microarrays, can you 
attach sets of regulons to all of the state IDs, and then give a minimal set of changes to the data
in the real microarrays needed to convert them to consistent microarrays.


MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3