[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Diff of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.4, Thu Nov 15 21:06:50 2007 UTC revision 1.5, Tue Feb 12 20:25:05 2008 UTC
# Line 1  Line 1 
1  <div align=center>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2  <h1>The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:</h1>  <html><head><title>Abstraction Working Document</title>
3  <br>  
4    </head>
5    <body>
6    <div align="center">
7    <h1>The Role of Bioinformatics in Interpretating Genomes of
8    Unicellular Organisms:</h1>
9  <h1>An Abstract View</h1>  <h1>An Abstract View</h1>
10  <h2>by Ross Overbeek</h2>  <h2>by Ross Overbeek, ...</h2>
11  </div>  </div>
12  <h2>Introduction</h2>  <h2>Introduction</h2>
13  This strange document began as a tutorial for computer scientists and mathematicians.  It was supposed  This strange document began as a tutorial for computer scientists and
14  to somehow introduce them to the computational issues in genome analysis.  mathematicians. It was supposed
15  It was requested by an instructor in a computer class.  Overbeek in attempting to respond to this request  to somehow introduce them to the computational issues in genome
16  formulated an abstraction that he began to believe had significance beyond the tutorial.  analysis.
17  <p>  It was requested by an instructor in a computer class. Overbeek in
18  This document is a set of working notes relating to the abstract.  It is not organized properly as  attempting to respond to this request
19  an abstraction, a tutorial, or an essay on the role of bioinformatics in support of biological research.  It is,  formulated an abstraction that he began to believe had significance
20  however, organized properly as a working document that relates to all of these goals.  beyond the tutorial.
21  <p>  <p>This document is a set of working notes relating to the
22  It begins with a development of the abstraction.  This will be suitable for mathematicians or computer scientists.  abstract. It is not organized properly as
23  The abstraction is developed in four steps: the basic abstraction, the enhanced abstraction needed to support  an abstraction, a tutorial, or an essay on the role of bioinformatics
24  basic bioinformatics support for biologists, and finally the third step which includes suport for the notion  in support of biological research. It is,
25  of regulation.  The intent throughout this discussion will be to seek a minimal set of concepts needed to  however, organized properly as a working document that relates to all
26  effectively capture the essence of the required data.  Unlike almost all efforts to lay a foundation  of these goals.
27  for tutorials, software or research in biology, this effort focuses on leaving out as much as possible.  </p>
28  While we do believe that there is an almost unlimited complexity that can be introduced, and almost all of  <p>It begins with a development of the abstraction. This will be
29  it is needed for some specific goals, the vast majority of tools and discussions require (we believe) relatively few  suitable for mathematicians or computer scientists.
30    The abstraction is developed in four steps: the basic abstraction, the
31    enhanced abstraction needed to support
32    basic bioinformatics support for biologists, and finally the third step
33    which includes suport for the notion
34    of regulation. The intent throughout this discussion will be to seek a
35    minimal set of concepts needed to
36    effectively capture the essence of the required data. Unlike almost all
37    efforts to lay a foundation
38    for tutorials, software or research in biology, this effort focuses on
39    leaving out as much as possible.
40    While we do believe that there is an almost unlimited complexity that
41    can be introduced, and almost all of
42    it is needed for some specific goals, the vast majority of tools and
43    discussions require (we believe) relatively few
44  concepts.  As they say, "the proof is in the pudding."  concepts.  As they say, "the proof is in the pudding."
45    </p>
46  <p>  <p>The second section will feature a bit more tutorial comments.
47  The second section will feature a bit more tutorial comments.  It may well repeat much of what is in Part 1.  It may well repeat much of what is in Part 1.
48  This part is offered as a way of easing a computer scientist of mathematician into the issues that need to be  This part is offered as a way of easing a computer scientist of
49  considered, if they wish to try to do useful research relating to the genomics revolution.  Eventually, this part  mathematician into the issues that need to be
50  will be dramatically expanded by giving condensed summaries of the machines of the cell broken into two broad  considered, if they wish to try to do useful research relating to the
51  sets: the metabolic network and the cellular machinery not directly included in the metabolic network.  Loosely,  genomics revolution. Eventually, this part
52  this separates what would be learned in a microbial biochemistry class (when they exist) from what would  will be dramatically expanded by giving condensed summaries of the
53    machines of the cell broken into two broad
54    sets: the metabolic network and the cellular machinery not directly
55    included in the metabolic network. Loosely,
56    this separates what would be learned in a microbial biochemistry class
57    (when they exist) from what would
58  be learned in a course on molecular biology.  be learned in a course on molecular biology.
59  <p>  </p>
60  The third part is an essay is an attempt to characterize our view on  <p>The third part is an essay is an attempt to characterize our
61    view on </p>
62  <ul>  <ul>
63  <li> what the main goals should be in current efforts to advance biological knowledge via genome research,  <li> what the main goals should be in current efforts to
64  <li> what role bioinformatics researchers have played in the past, and  advance biological knowledge via genome research,
65  <li> what role they could productively play during the coming few years.  </li>
66    <li> what role bioinformatics researchers have played in the
67    past, and
68    </li>
69    <li> what role they could productively play during the coming
70    few years.
71    </li>
72  </ul>  </ul>
73  As such, it is undoubtedly an arrogant formulation by a group of individuals with minimal background in  As such, it is undoubtedly an arrogant formulation by a group of
74    individuals with minimal background in
75  biology.  biology.
76  <p>  <p>The fourth section will focus on the imlications of the
77  The fourth section will focus on the imlications of the abstractions in software development.  abstractions in software development.
78  This is a bit of a radical proposal that makes sense to us (and is in an area that we can  This is a bit of a radical proposal that makes sense to us (and is in
79    an area that we can
80  legitimately claim expertise).  legitimately claim expertise).
81    </p>
82  <h1>Part 1: The Abstractions</h1>  <h1>Part 1: The Abstractions</h1>
83  <h2>The cell: a Minimal Perspective</h2>  <h2>The cell: a Minimal Perspective</h2>
84    A <b>cell</b> is a bag (i.e., a volume enclosed by a
85  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.  membrane) that contains three types of things: compounds, cellular
86  <p>  machines, and a genome.
87  By the term <b>compound</b> we refer to the normal notion of chemical compound.  <p>By the term <b>compound</b> we refer to the
88  <p>  normal notion of chemical compound. </p>
89    <p>A <b>cellular machine</b> is a set of proteins
90  A <b>cellular machine</b> is a set of proteins that together perform a function. Unless otherwise noted,  that together perform a function. Unless otherwise noted,
91  when we use the term <i>machine</i> we will always be speaking of a cellular machine.  when we use the term <i>machine</i> we will always be
92    speaking of a cellular machine.
93  Many machines  Many machines
94  transform one set of compounds into another set.  Some machines (transport machines)  transform one set of compounds into another set. Some machines
95  are used to move compounds into  (transport machines) are used to move compounds into
96  or out of the cell.  Later we will try to convey a more comprehensive notion of what functions are implemented  or out of the cell. Later we will try to convey a more comprehensive
97    notion of what functions are implemented
98  by machines that we understand.  by machines that we understand.
99  <p>  </p>
100    <p>A <b>protein</b> is a string of amino acids
101  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).  (i.e., a string in the 20-character alphabet
102  <p>  {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
103    </p>
104  A <b>genome</b> is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).  <p>A <b>genome</b> is a string of DNA bases (i.e., a
105  <p>  string in the 4-character alphabet {A,C,G,T}).
106    </p>
107  A <b>gene</b> is a region in the genome that describes how to build a  <p>A <b>gene</b> is a region in the genome that
108    describes how to build a
109  protein.  The description is a sequence of 3-character codons.  Each  protein.  The description is a sequence of 3-character codons.  Each
110  codon corresponds to either a single amino acid or a stop codon.  codon corresponds to either a single amino acid or a stop codon.
111  There are three stop codons: {TAA,TAG,TGA}.  The genetic code is the  There are three stop codons: {TAA,TAG,TGA}.  The genetic code is the
112  table of correspondences between codons and amino acids:  table of correspondences between codons and amino acids:
113  <br><br>  <br>
114  <table border>  <br>
115  <tr><th>Amino Acid</th><th>Codons</th></tr>  <table border="1">
116  <tr><td>A</td> <td>GCT, GCC, GCA, GCG </td></tr>  <tbody>
117  <tr><td>C</td> <td>TGT, TGC</td></tr>  <tr>
118  <tr><td>D</td> <td>GAT, GAC</td></tr>  <th>Amino Acid</th>
119  <tr><td>E</td> <td>GAA, GAG</td></tr>  <th>Codons</th>
120  <tr><td>F</td> <td>TTT, TTC</td></tr>  </tr>
121  <tr><td>G</td> <td>GGT, GGC, GGA, GGG</td></tr>  <tr>
122  <tr><td>H</td> <td>CAT, CAC</td></tr>  <td>A</td>
123  <tr><td>I</td> <td>ATT, ATC, ATA</td></tr>  <td>GCT, GCC, GCA, GCG </td>
124  <tr><td>K</td> <td>AAA, AAG</td></tr>  </tr>
125  <tr><td>L</td> <td>TTA, TTG, CTT, CTC, CTA, CTG</td></tr>  <tr>
126  <tr><td>M</td> <td>ATG</td></tr>  <td>C</td>
127  <tr><td>N</td> <td>AAT, AAC</td></tr>  <td>TGT, TGC</td>
128  <tr><td>P</td> <td>CCT, CCC, CCA, CCG</td></tr>  </tr>
129  <tr><td>Q</td> <td>CAA, CAG</td></tr>  <tr>
130  <tr><td>R</td> <td>CGT, CGC, CGA, CGG, AGA, AGG</td></tr>  <td>D</td>
131  <tr><td>S</td> <td>TCT, TCC, TCA, TCG, AGT, AGC</td></tr>  <td>GAT, GAC</td>
132  <tr><td>T</td> <td>ACT, ACC, ACA, ACG</td></tr>  </tr>
133  <tr><td>V</td> <td>GTT, GTC, GTA, GTG</td></tr>  <tr>
134  <tr><td>W</td> <td>TGG</td></tr>  <td>E</td>
135  <tr><td>Y</td> <td>TAT, TAC</td></tr>  <td>GAA, GAG</td>
136  <tr><td>*</td> <td>TAG, TGA, TAA  [Stop codons]</td></tr>  </tr>
137    <tr>
138    <td>F</td>
139    <td>TTT, TTC</td>
140    </tr>
141    <tr>
142    <td>G</td>
143    <td>GGT, GGC, GGA, GGG</td>
144    </tr>
145    <tr>
146    <td>H</td>
147    <td>CAT, CAC</td>
148    </tr>
149    <tr>
150    <td>I</td>
151    <td>ATT, ATC, ATA</td>
152    </tr>
153    <tr>
154    <td>K</td>
155    <td>AAA, AAG</td>
156    </tr>
157    <tr>
158    <td>L</td>
159    <td>TTA, TTG, CTT, CTC, CTA, CTG</td>
160    </tr>
161    <tr>
162    <td>M</td>
163    <td>ATG</td>
164    </tr>
165    <tr>
166    <td>N</td>
167    <td>AAT, AAC</td>
168    </tr>
169    <tr>
170    <td>P</td>
171    <td>CCT, CCC, CCA, CCG</td>
172    </tr>
173    <tr>
174    <td>Q</td>
175    <td>CAA, CAG</td>
176    </tr>
177    <tr>
178    <td>R</td>
179    <td>CGT, CGC, CGA, CGG, AGA, AGG</td>
180    </tr>
181    <tr>
182    <td>S</td>
183    <td>TCT, TCC, TCA, TCG, AGT, AGC</td>
184    </tr>
185    <tr>
186    <td>T</td>
187    <td>ACT, ACC, ACA, ACG</td>
188    </tr>
189    <tr>
190    <td>V</td>
191    <td>GTT, GTC, GTA, GTG</td>
192    </tr>
193    <tr>
194    <td>W</td>
195    <td>TGG</td>
196    </tr>
197    <tr>
198    <td>Y</td>
199    <td>TAT, TAC</td>
200    </tr>
201    <tr>
202    <td>*</td>
203    <td>TAG, TGA, TAA [Stop codons]</td>
204    </tr>
205    </tbody>
206  </table>  </table>
207  <br><br>  <br>
208  <hr>  <br>
209  The process of building a protein as a string of amino acids from the gene containing codons is  </p>
210    <hr>The process of building a protein as a string of amino acids
211    from the gene containing codons is
212  called <b>expressing</b> the gene.  called <b>expressing</b> the gene.
213  <br>  <br>
214  A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.  A <b>subsystem</b> (i.e., an abstract cellular machine) is
215  Each protein implements one or more functional roles.  The set of functional roles  a set of functional roles.
216  implemented by the protein is called the <b>function of the protein</b>.  The function of a  multifunctional  Each protein implements one or more functional roles. The set of
217  protein that implements {functional-role-1,functional-role-2} is normally written as  functional roles
218    implemented by the protein is called the <b>function of the
219    protein</b>. The function of a multifunctional
220    protein that implements {functional-role-1,functional-role-2} is
221    normally written as
222  <i>functional-role-1 / functional-role-2</i>.  <i>functional-role-1 / functional-role-2</i>.
223  <br><br>  <br>
224  A <b>populated subsystem</b> is a subsystem with an attached spreadsheet.  Each column  <br>
225  in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to  A <b>populated subsystem</b> is a subsystem with an
226  a specific genome.  Each cell in the spreadsheet contains the genes from the corresponding genome  attached spreadsheet. Each column
227  that implement the designated functional role (there may be 0 or more such genes).  in the spreadsheet corresponds to a functional role in the subsystem,
228  <br><br>  and each row corresponds to
229  We do not actually know what machines are present in a cell.  We are in the midst of a grand  a specific genome. Each cell in the spreadsheet contains the genes from
230  effort to clarify which are there and what they do.  The formulation of subsystems as abstract machines  the corresponding genome
231  in which each row of the subsystem describes a specific cellular machine that is believed to be present,  that implement the designated functional role (there may be 0 or more
232    such genes).
233    <br>
234    <br>
235    We do not actually know what machines are present in a cell. We are in
236    the midst of a grand
237    effort to clarify which are there and what they do. The formulation of
238    subsystems as abstract machines
239    in which each row of the subsystem describes a specific cellular
240    machine that is believed to be present,
241  represents a way to maintain a collection of estimates or assertions.  represents a way to maintain a collection of estimates or assertions.
242  <p>  <p>A <b>protein family</b> is defined to be a set of
243  A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and  proteins that implement the same functional roles and
244  are similar over the entire lengths of the proteins.  are similar over the entire lengths of the proteins.
245  <p>  </p>
246  We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.  <p>We seek a situation in which each protein occurs in one or
247  <p>  more subsystems and in a single protein family.
248  In any specific cell, sets of specific cellular machines are  </p>
249  switched on and off as units.  That is, they are <i>co-regulated</i>.  We will call such a set  <p>In any specific cell, sets of specific cellular machines are
250  of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing  switched on and off as units. That is, they are <i>co-regulated</i>.
251  a single cellular machine).  A <b>state</b> of a cell will be defined  We will call such a set
252  as the set of regulons that are operational at a point in time.  Thus, a state amounts to the set  of <i>co-regulated cellular machines</i> a <b>regulon</b>
253    (note that a regulon is often a set containing
254    a single cellular machine). A <b>state</b> of a cell will
255    be defined
256    as the set of regulons that are operational at a point in time. Thus, a
257    state amounts to the set
258  of cellular machines that are operational at one instant.  of cellular machines that are operational at one instant.
259  <p>  </p>
260  Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a  <p>Microarrays are, for a given genome, two lists of genes that
261  cell.  Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the  "changed expression levels" between two states of a
262  second list contains genes that were "active" in the second but not the first.  If a cellular  cell. Basicaly, the first list contains genes that were "active" during
263  machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in  the first state, but not the second; and the
264  only one cellular machine, then it would be reasonable to infer that you could say that the machine was  second list contains genes that were "active" in the second but not the
265    first. If a cellular
266    machine utilizes protein <i>X</i>, and <i>X</i>
267    is in the first list, and if <i>X</i> is used in
268    only one cellular machine, then it would be reasonable to infer that
269    you could say that the machine was
270  active in the first state, but not the second.  active in the first state, but not the second.
271    </p>
272  <h2>The cell: the Enhanced Formlation Needed to Support Bioinformatics</h2>  <h2>The cell: the Enhanced Formlation Needed to Support
273    Bioinformatics</h2>
274  In the enhanced abstraction, we need to losen up some concepts.  In particular,  In the enhanced abstraction, we need to losen up some concepts. In
275    particular,
276  <ul>  <ul>
277  <li> A <b>genome</b> is a set of strings in a 4-character alphabet.  Each of the strings  <li> A <b>genome</b> is a set of strings in a
278  is called a <b>contig</b>.  Note that the concept as formulated covers both incomplete genomes and  4-character alphabet. Each of the strings
279  genomes with multiple replicons.  is called a <b>contig</b>. Note that the concept as
280    formulated covers both incomplete genomes and genomes with multiple
281    replicons.
282    </li>
283  <li>The genes within a genome are of two distinct types:  <li>The genes within a genome are of two distinct types:
284  <ol>  <ol>
285  <li>those that describe how to construct a protein (i.e., prtein-encoding genes), and  <li>those that describe how to construct a protein (i.e.,
286  <li>those that describe how to construct a string of RNA (i.e., how to construct a string in the  prtein-encoding genes), and
287    </li>
288    <li>those that describe how to construct a string of RNA
289    (i.e., how to construct a string in the
290  4-character RNA alphabet {A,C,G,U}).  4-character RNA alphabet {A,C,G,U}).
291    </li>
292  </ol>  </ol>
293  <br><br>  <br>
294  <li>The location of a gene is generalized to be a set of regions within the genome (that are  <br>
295  concatenated to form the instructions needed to construct either a protein or a string of RNA).  </li>
296  <li>A protein is a character in an alphabet that now includes the 20 character codes from  <li>The location of a gene is generalized to be a set of
297  the basic abstraction plus a very limited set of extra codes.  regions within the genome (that are
298  We already have cases in which <i>selenocyctein</i>  and <i>pyrrolysine</i> appear as nonstandard  concatenated to form the instructions needed to construct either a
299  translations of codons, and there may eventually be more.  protein or a string of RNA).
300    </li>
301  <li>Each protein-encoding gene has both a DNA sequence (by defintion) and a translation.  However,  <li>A protein is a character in an alphabet that now includes
302  the translation is not required to exactly match what a codon-by-codon translation of the DNA sequence  the 20 character codes from
303  would produce.  This allows us to handle the very rare instances in which selenocystein occurs as the translatin  the basic abstraction plus a very limited set of extra codes. We
304  of TGA or pyrrolysine occurs as a translation of TAG (and others, if necessary).  already have cases in which <i>selenocyctein</i> and <i>pyrrolysine</i>
305    appear as nonstandard translations of codons, and there may eventually
306    be more.
307    </li>
308    <li>Each protein-encoding gene has both a DNA sequence (by
309    defintion) and a translation. However,
310    the translation is not required to exactly match what a codon-by-codon
311    translation of the DNA sequence
312    would produce. This allows us to handle the very rare instances in
313    which selenocystein occurs as the translatin
314    of TGA or pyrrolysine occurs as a translation of TAG (and others, if
315    necessary).
316    </li>
317  </ul>  </ul>
318    This loosened up formulation represents a very minimal set of changes.
319  This loosened up formulation represents a very minimal set of changes.  They should be left out of the  They should be left out of the
320  basic tutorial for computer scientists and mathematicians.  basic tutorial for computer scientists and mathematicians.
321    <h2>The cell: Adding the Concepts Needed to Discuss
322  <h2>The cell: Adding the Concepts Needed to Discuss Transcriptional Regulation</h2>  Transcriptional Regulation</h2>
323    In the final version of the abstraction, we add the minimal set of
324  In the final version of the abstraction, we add the minimal set of notions needed to support  notions needed to support
325  analysis of transcriptional regulation.  An <b>operon</b> is a set of contiguous genes that are all  analysis of transcriptional regulation. An <b>operon</b>
326  on the same strand and are all co-regulated.  We consider a gene that is not co-regulated with any adjacent genes  is a set of contiguous genes that are all on the same strand and are
327  to be an operon composed of just itself.  A <b>binding site</b> is a small region of DNA (normally  all co-regulated. We consider a gene that is not co-regulated with any
328  occurring a short space ahead of an operon) that acts as a switch turning the operon "on" or "off". When  adjacent genes
329  a specific protein or expressed RNA called a <b>transcriptional regulator</b> binds the site, it flips the switch.  One or more  to be an operon composed of just itself. A <b>binding site</b>
330  specific transcriptional regulators can bind a specific site (i.e., sets of  is a small region of DNA (normally
331   sites are associated with each specific transcriptional regulator).  The effect of a regulator binding at a site  occurring a short space ahead of an operon) that acts as a switch
332  always has the same effect (either activating or deactivating the operon), but which effect depends on  turning the operon "on" or "off". When
333    a specific protein or expressed RNA called a <b>transcriptional
334    regulator</b> binds the site, it flips the switch. One or more
335    specific transcriptional regulators can bind a specific site (i.e.,
336    sets of sites are associated with each specific transcriptional
337    regulator). The effect of a regulator binding at a site
338    always has the same effect (either activating or deactivating the
339    operon), but which effect depends on
340  the site-regulator pair.  the site-regulator pair.
   
341  <h1>Part 1: Tutorial Notes</h1>  <h1>Part 1: Tutorial Notes</h1>
   
342  <h2>Notes for The Basic Abstraction</h2>  <h2>Notes for The Basic Abstraction</h2>
343    We will be speaking about organisms that are a single cell. At some
344  We will be speaking about organisms that are a single cell.  At some point life began on earth.  point life began on earth.
345  The single-celled organisms that we know of replicate producing copies of themselves that have  The single-celled organisms that we know of replicate producing copies
346  genomes which usually have very, very similar content to that of the parent cell.  <b>Evolution</b> is the  of themselves that have
347  process in which cells replicate with some alterations in their genomes, are subjected to  genomes which usually have very, very similar content to that of the
348  <i>selective pressure</i>, and survive or not depending on many somewhat random factors.  The makeup of  parent cell. <b>Evolution</b> is the
349  cells (i.e., the genomes they contain and the machines that define what they are capable of doing)  process in which cells replicate with some alterations in their
350    genomes, are subjected to
351    <i>selective pressure</i>, and survive or not depending on
352    many somewhat random factors. The makeup of
353    cells (i.e., the genomes they contain and the machines that define what
354    they are capable of doing)
355  changes gradually (and sometimes not so gradually) as time passes.  changes gradually (and sometimes not so gradually) as time passes.
356  <p>  <p>The original life forms that existed billions of years ago
357  The original life forms that existed billions of years ago have evolved into three broad categories of  have evolved into three broad categories of
358  life forms.  That is, the evolutinary process led to early divisions, and these led to three main  life forms. That is, the evolutinary process led to early divisions,
359    and these led to three main
360  categories of single-celled organisms.  We call these three forms the <b>archaea</b>,  categories of single-celled organisms.  We call these three forms the <b>archaea</b>,
361  the <b>bacteria</b>, and the <b>eukaryotes</b>.  the <b>bacteria</b>, and the <b>eukaryotes</b>.
362  A majority of the organisms for which we have acquired complete genomes are from the bacteria,  A majority of the organisms for which we have acquired complete genomes
363  although the  are from the bacteria, although the
364  numbers are rapidly growing for all three domains.  numbers are rapidly growing for all three domains.
365  <p>  </p>
366  The minimal notion of a cell is enough to explain some of the basic  <p>The minimal notion of a cell is enough to explain some of the
367    basic
368  problems in bioinformatics:  problems in bioinformatics:
369    </p>
370  <h3>Identify the genes within a genome</h3>  <h3>Identify the genes within a genome</h3>
   
371  If we are to understand the contents of genomes, we will need to  If we are to understand the contents of genomes, we will need to
372  locate the genes that occur in each genome.  This problem simply involves taking a genome (a  locate the genes that occur in each genome. This problem simply
373  string of DNA) and locating the set of genes it contains.  involves taking a genome (a
374  In the case of bacteria and archaea, we know pretty well how to  string of DNA) and locating the set of genes it contains. In the case
375  locate the genes.  of bacteria and archaea, we know pretty well how to
376  Once we  locate the genes. Once we
377  have identified instances from many genomes, it becomes possible to  have identified instances from many genomes, it becomes possible to
378  recognize the genes in a new genome by just looking for things similar  recognize the genes in a new genome by just looking for things similar
379  to those we already understand.  The following problem is At the heart of reconizing when two  to those we already understand. The following problem is At the heart
380    of reconizing when two
381  genes are "similar".  genes are "similar".
382    <h3>Given two genes. "align" them in a way that minimizes some
383  <h3>Given two genes. "align" them in a way that minimizes some edit function.  </h3>  edit function. </h3>
384    For example, here is what you see when you align two genes from
385  For example, here is what you see when you align two genes from distinct organisms:  distinct organisms:
386    <pre>gene1 ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC<br>gene2 ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC<br>* * * * * *** *** ** * **** * *** **** * ***<br>gene1 GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT<br>gene2 GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC<br>*** **** ** *** *** ** * **** *** ** * * *** * gene1 CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC<br>gene2 ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC<br>***** ***** *** **** * ** ** * ** *** ****** ***<br>gene1 ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG<br>gene2 ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG<br>********* **** ** ** ** ** ***** * ** ** ** *** ** *<br>gene1 GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC<br>gene2 GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC<br>** ** *** *********** ** **** *** ** * * ***<br>gene1 GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG<br>gene2 GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG<br>******** ** ***** ***** * * ** ** * *****<br>gene1 GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC<br>gene2 AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG<br>* **** * ********* ******* *** * *** ******** gene1 GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA<br>gene2 GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA<br>* ******** *** *** *** * ** ** * * ** ** * ****<br></pre>
 <pre>  
   
   
 gene1           ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC  
 gene2           ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC  
                    *   *  *    * * ***  ***    ** *  ****  * ***  **** * ***  
   
 gene1           GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT  
 gene2           GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC  
                 *** **** **  ***     *** ** *  ****   *** ** *     * ***  *  
   
 gene1           CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC  
 gene2           ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC  
                    ***** ***** *** ****  * ** **  *   ** ***      ****** ***  
   
 gene1           ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG  
 gene2           ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG  
                 ********* **** ** ** **    ** *****  * ** ** ** ***   **   *  
   
 gene1           GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC  
 gene2           GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC  
                 ** ** ***   ***********         **  ****  ***    ** *  * ***  
   
 gene1           GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG  
 gene2           GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG  
                 ********   **     ***** ***** * * ** **            *   *****  
   
 gene1           GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC  
 gene2           AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG  
                  *   **** * *********   ******* ***        * ***   ********  
   
 gene1           GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA  
 gene2           GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA  
                 *  ********    ***   ***  ***     * ** ** * * **   ** * ****  
 </pre>  
387  <hr>  <hr>
388    The sequences are recognizably similar, and in fact implement exactly
389  The sequences are recognizably similar, and in fact implement exactly the same function  the same function
390  in the two cells.  If we align the protein sequences corresponding to these two  in the two cells. If we align the protein sequences corresponding to
391    these two
392  genes, we get  genes, we get
393    <pre>gene1 MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN<br>gene2 -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN<br> :* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***<br><br>gene1 IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT<br>gene2 IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E<br> ***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:<br><br>gene1 VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR<br>gene2 TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ<br> .:***** **.: :*.*** **::*:* * :**:.:*:<br></pre>
394  <pre>  There is a great deal of work relating to recognizing when two
395  gene1           MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN  sequences are
396  gene2           -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN  similar and whether or not they had a common ancestor. Understanding
397                    :*  * * :::*:* ****:**:. *:: * **:  *: *:***:*:***.:*  ***  why
398    selective pressure conserves sections of sequences, but not others,
399  gene1           IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT  will yield
400  gene2           IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E  important clues. Can you reason out why some sections might be
401                  ***:****.***:****:: *:* ****..  *.*::.:****.: ***: .:  conserved, while
   
 gene1           VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR  
 gene2           TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ  
                 .:***** **.:  :*.*** **::*:* * :**:.:*:  
 </pre>  
   
 There is a great deal of work relating to recognizing when two sequences are  
 similar and whether or not they had a common ancestor.  Understanding why  
 selective pressure conserves sections of sequences, but not others, will yield  
 important clues.  Can you reason out why some sections might be conserved, while  
402  others vary wildly?  others vary wildly?
403  <p>  <p>Comparing sets of sequences that have retained the same
404    function is
405  Comparing sets of sequences that have retained the same function is  at the heart of understanding cellular machines and the proteins that
406  at the heart of understanding cellular machines and the proteins that implement them.  implement them. We find that looking at sets (often with more than two
407  We find that looking at sets (often with more than two sequences) and aligning them  sequences) and aligning them
408  is important.  is important.
409    </p>
410    <h3> Given a set of sequences, align them in a way that minimizes
411  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>  some edit function.</h3>
   
412  Here is an example of a multiple sequence alignment:  Here is an example of a multiple sequence alignment:
413  <br>  <br>
414  <br>  <br>
415  <pre>  <pre>CLUSTAL W (1.83) multiple sequence alignment<br><br><br>seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE<br>seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA<br>seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN<br>seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT<br>seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE<br> *. . . .: :.: **..: ** .* : :<br><br>seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL<br>seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL<br>seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL<br>seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL<br>seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL<br> * : : .: :. .. :.. : : .*: . *<br><br>seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI<br>seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI<br>seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI<br>seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV<br>seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV<br> ************.. :::..:: . * : : :: *******:*. . :.:<br><br>seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV<br>seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV<br>seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV<br>seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA<br>seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA<br> :::*:.: * :* : *: .: : * ** ** :** * * : * :.<br><br>seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI<br>seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM<br>seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI<br>seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI<br>seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI<br> **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::<br><br>seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA<br>seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA<br>seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA<br>seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------<br>seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------<br> *.* *: . : : . : ::: :**: ..*: * : *.<br><br>seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC<br>seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP<br>seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL<br>seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ<br>seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG<br> . : : :** * . * :<br><br>seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA<br>seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR<br>seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA<br>seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK<br>seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS<br> : * ****.** ::: :* * * : : .:<br><br>seq3 FVSQHGNRGKPL<br>seq4 FMSGHLGA----<br>seq5 FIEKKAL-----<br>seq1 LMMNHQ------<br>seq2 YLLGK-------<br> : :<br></pre>
416  CLUSTAL W (1.83) multiple sequence alignment  <h3> Given a multiple sequence alignment, determine the most
417    likely evolutionary history of the sequences (i.e., construct a
418    phylogenetic tree).</h3>
419  seq3            -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE  From the extant five sequences that are similar and displayed in the
420  seq4            -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA  previous alignment, we can construct
 seq5            -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN  
 seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT  
 seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE  
                                    *.  . .      .: :.:  **..: ** .*  :  :  
   
 seq3            EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL  
 seq4            ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL  
 seq5            TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL  
 seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL  
 seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL  
                        * :   : .:   :.   ..   :.. :  :         .*:   .     *  
   
 seq3            ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI  
 seq4            ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI  
 seq5            ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI  
 seq1            ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV  
 seq2            ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV  
                 ************.. :::..::  .    * : : :: *******:*.  .      :.:  
   
 seq3            FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV  
 seq4            FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV  
 seq5            FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV  
 seq1            VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA  
 seq2            YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA  
                  :::*:.: * :*   : *:   .:  :   * ** ** :**  * *  :      * :.  
   
 seq3            NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI  
 seq4            NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM  
 seq5            NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI  
 seq1            NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI  
 seq2            NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI  
                 **** :*.:.*  *** *  ::      .  . .**:****:: ** ..  :***: :::  
   
 seq3            VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA  
 seq4            IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA  
 seq5            LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA  
 seq1            AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------  
 seq2            AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------  
                  *.* *: . : :  . :       ::: :**:  ..*: * :  *.  
   
 seq3            FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC  
 seq4            FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP  
 seq5            LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL  
 seq1            -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ  
 seq2            -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG  
                                       .    : :                :** * .  *  :  
   
 seq3            RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA  
 seq4            GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR  
 seq5            VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA  
 seq1            LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK  
 seq2            LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS  
                         : * ****.** :::    :*     *  *            :  :   .:  
   
 seq3            FVSQHGNRGKPL  
 seq4            FMSGHLGA----  
 seq5            FIEKKAL-----  
 seq1            LMMNHQ------  
 seq2            YLLGK-------  
                  :  :  
 </pre>  
   
 <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>  
   
 From the extant five sequences that are similar and displayed in the previous alignment, we can construct  
421  a tree that depicts the "phylogenetic history" of the sequences.  a tree that depicts the "phylogenetic history" of the sequences.
422  Here is one reasonable tree for the last 5 sequences.  Here is one reasonable tree for the last 5 sequences.
   
423  <pre>  <pre>
424                            ,--------------------------------------------------- seq1                            ,--------------------------------------------------- seq1
425                            |                            |
# Line 381  Line 431 
431         |         |
432         |         |
433         |         |
434    ,----|    |
435      |
436      |             ,-------------------------------- seq3
437      |             |
438    |    |    |    |
439    |    |             ,-------------------------------- seq3    |-------------|
   |    |             |  
   |    |             |  
   |    |-------------|  
440    |                  |    |                  |
441    |                  |    |                  |
442    |                  `------------------------------ seq4    |                  `------------------------------ seq4
# Line 394  Line 444 
444    |    |
445    `---------------------------------------------- seq5    `---------------------------------------------- seq5
446  </pre>  </pre>
   
447  The tree suggests that at some point an ancestral  The tree suggests that at some point an ancestral
448  cell replicated.  One copy led (through a chain of descendants) to <b>seq5</b>, while the remaining sequences descend  cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>,
449  from the ther copy.  while the remaining sequences descend from the ther copy.
450  <p>  <p>Note that we now have alignments that
451  Note that we now have alignments that  contain thousands of sequences, and even displaying such trees is
452  contain thousands of sequences, and even displaying such trees is nontrivial.  nontrivial.
453  Because evolution plays such a central role in the phenomena we study, the construction of alignments  Because evolution plays such a central role in the phenomena we study,
454  and trees in order to compare extant versions of proteins and gain insight into their historical origins  the construction of alignments
455    and trees in order to compare extant versions of proteins and gain
456    insight into their historical origins
457  is considered basic to the task at hand.  is considered basic to the task at hand.
458    </p>
459  <h3>Some Random Facts that You Should Absorb</h2>  <h3>Some Random Facts that You Should Absorb</h3>
460    Most genomes of bacteria contain between 400,000 and 12,000,000
461  Most genomes of bacteria contain between 400,000 and 12,000,000 characters.  characters. Normally, the genes in a genome
 Normally, the genes in a genome  
462  cover abut 90% of the genome.  cover abut 90% of the genome.
463  Normally, there is about one gene per 1000 characters in a bacterial genome.  Normally, there is about one gene per 1000 characters in a bacterial
464  <p>  genome.
465  So,  <p>So, </p>
466  <ul>  <ul>
467  <li> What is the length of the average protein sequence?  <li> What is the length of the average protein sequence? </li>
468  <li>How many genes do these  <li>How many genes do these
469  genomes have?  genomes have? </li>
470  <li>What is the average length of a gene?  <li>What is the average length of a gene?
471    </li>
472  </ul>  </ul>
473  <br>  <br>
474  It is worth spending just a short bit of time thinking about what types of  It is worth spending just a short bit of time thinking about what types
475    of
476  machines must exist in each cell.  Here are a few thoughts to start with  machines must exist in each cell.  Here are a few thoughts to start with
477  <ul>  <ul>
478  <li>  <li>
479  There must be one or more machines that support replication of the cell.  You would  There must be one or more machines that support replication of the
480  need something to copy the genome, and you would need something that could build the DNA  cell. You would
481  bases that represent the characters (i.e., you will need machines to build the molecules  need something to copy the genome, and you would need something that
482  corresponding to each of the four characters in the alphabet of DNA bases.  could build the DNA
483  <li>  bases that represent the characters (i.e., you will need machines to
484  As we mentioned, you have transport machines that take things into and out of the cell.  Many  build the molecules
485  cells can import food in the form of sugar molecules.  For example, many cells can import  corresponding to each of the four characters in the alphabet of DNA
486  <i>glucose</i> a six-carbon compound.  As the compound gets broken down into smaller compounds,  bases.
487  energy is salvaged from the broken bonds to power the machines in the cell.  The smaller compounds  </li>
488    <li>As we mentioned, you have transport machines that take
489    things into and out of the cell. Many
490    cells can import food in the form of sugar molecules. For example, many
491    cells can import
492    <i>glucose</i> a six-carbon compound. As the compound
493    gets broken down into smaller compounds,
494    energy is salvaged from the broken bonds to power the machines in the
495    cell. The smaller compounds
496  are used as building blocks for other needs.  are used as building blocks for other needs.
497  <li>  </li>
498  There must be one or more machines involved in building proteins from the descriptions in te genes.  <li>There must be one or more machines involved in building
499  In particular, we will need a machine for each of the amino acids (unless the cell can import some  proteins from the descriptions in te genes.
500    In particular, we will need a machine for each of the amino acids
501    (unless the cell can import some
502  of them).  of them).
503  <li>  </li>
504  There must be mechanisms for sensing what is going on in the environment and allowing the cell  <li>There must be mechanisms for sensing what is going on in
505    the environment and allowing the cell
506  to react to it.  For example, many cells can "swim" towards food.  to react to it.  For example, many cells can "swim" towards food.
507    </li>
508  </ul>  </ul>
509  Those were just a few examples.  For any cell, we have many, many machines, and we still  Those were just a few examples. For any cell, we have many, many
510  do not even understand what some of them do.  Later, we will try to offer a more structured  machines, and we still
511    do not even understand what some of them do. Later, we will try to
512    offer a more structured
513  estimate of what is already known.  estimate of what is already known.
514  <p>  <p>About 50-60% of the genes occur within 5000 characters of
515  About 50-60% of the genes occur within 5000 characters of another gene such that  another gene such that
516  the two genes encode proteins that are part of the same cellular machine.  This fact  the two genes encode proteins that are part of the same cellular
517  suggests that just having a large number of genomes would enable a person to group  machine. This fact suggests that just having a large number of genomes
518  the genes into the machines they implement, without the person understanding the functions  would enable a person to group
519    the genes into the machines they implement, without the person
520    understanding the functions
521  of the machines or the roles played by each protein.  of the machines or the roles played by each protein.
522  <p>  </p>
523  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in  <p>Occasionally, proteins that are usually distinct in most cells
524  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and  are fused into a single protein in
525  in most cells in which the proteins are not fused, the two distinct proteins are separate components  a few cells. In these cases, the fused gene is (by definition) part of
526  of a single machine.  This, too, offers clues to support analysis of which proteins go with which machines.  a single machine, and
527  <p>  in most cells in which the proteins are not fused, the two distinct
528  Biologists have figured out the roles of about 50% of the genes.  That is, they can  proteins are separate components
529  place the gene in a cellular machine, they know what the machine does, and they know  of a single machine. This, too, offers clues to support analysis of
530  the specific role of the gene in sustaining the functionality of the machine.  which proteins go with which machines.
531  <br><br>  </p>
532    <p>Biologists have figured out the roles of about 50% of the
533  <h23Imposing a Structure on Characterizing the Inventory</h2>  genes. That is, they can
534    place the gene in a cellular machine, they know what the machine does,
535  One central goal of bioinformatics is to support an accurate characterization of the cellular  and they know
536  machinery for each cell.  It is of major importance to biologsts that we be able to support  the specific role of the gene in sustaining the functionality of the
537  comparative analysis of cells.  Perhaps, the most important aspect of understanding cells relates to  machine.
538  their origin in an evolutionary process.  Cells have a long evolutionary history dating back billions of  <br>
539  years.  The machines we see in cells today arose in the past, so we expect to see many current cells  <br>
540  using machinery that resembles what turns up in other cells.  When we compare machines from different  <h23imposing a="" structure="" on="" characterizing="" the="" inventory="">
541  cells they often look remarkably similar.  On the other hand, those that had a common origin in a cell that existed  One central goal of bioinformatics is to support an accurate
542  billions of years in the past may now have versions that are not very similar.  Modifications, optimizations,  characterization of the cellular
543  and insignificant alterations all combine to explore the space of operational possibilities for  machinery for each cell. It is of major importance to biologsts that we
544  each type of machine.  Hence, we need a framework for studying similarities and differences in the  be able to support
545    comparative analysis of cells. Perhaps, the most important aspect of
546    understanding cells relates to
547    their origin in an evolutionary process. Cells have a long evolutionary
548    history dating back billions of
549    years. The machines we see in cells today arose in the past, so we
550    expect to see many current cells
551    using machinery that resembles what turns up in other cells. When we
552    compare machines from different
553    cells they often look remarkably similar. On the other hand, those that
554    had a common origin in a cell that existed billions of years in the
555    past may now have versions that are not very similar. Modifications,
556    optimizations,
557    and insignificant alterations all combine to explore the space of
558    operational possibilities for
559    each type of machine. Hence, we need a framework for studying
560    similarities and differences in the
561  cellular machines and the proteins that implement them.  cellular machines and the proteins that implement them.
562  <p>  </h23imposing></p>
563    <p>Here is a short formulation of one way to do this:
564  Here is a short formulation of one way to do this:  <br>
565  <br><br>  <br>
566    </p>
567  <ul>  <ul>
568  <li>A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.  <li>A <b>subsystem</b> (i.e., an abstract cellular
569  <li>Each protein implements one or more functional roles.  The set of functional roles  machine) is a set of functional roles.
570  implemented by the protein is called the <b>function of the protein</b>.  The function of a  multifunctional  </li>
571  protein that implements {functional-role-1,functional-role-2} is normally written as  <li>Each protein implements one or more functional roles. The
572    set of functional roles
573    implemented by the protein is called the <b>function of the
574    protein</b>. The function of a multifunctional
575    protein that implements {functional-role-1,functional-role-2} is
576    normally written as
577  <i>functional-role-1 / functional-role-2</i>.  <i>functional-role-1 / functional-role-2</i>.
578  <br><br>  <br>
579  <li>A <b>populated subsystem</b> is a subsystem with an attached spreadsheet.  Each column  <br>
580  in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to  </li>
581  a specific genome.  Each cell in the spreadsheet contains the genes from the corresponding genome  <li>A <b>populated subsystem</b> is a subsystem
582  that implement the designated functional role (there may be 0 or more such genes).  with an attached spreadsheet. Each column
583  </ul>  in the spreadsheet corresponds to a functional role in the subsystem,
584  <br><br>  and each row corresponds to
585  We do not actually know what machines are present in a cell.  We are in the midst of a grand  a specific genome. Each cell in the spreadsheet contains the genes from
586  effort to clarify which are there and what they do.  The formulation of subsystems as abstract machines  the corresponding genome
587  in which each row of the subsystem describes a specific cellular machine that is believed to be present,  that implement the designated functional role (there may be 0 or more
588    such genes).
589    </li>
590    </ul>
591    <br>
592    <br>
593    We do not actually know what machines are present in a cell. We are in
594    the midst of a grand
595    effort to clarify which are there and what they do. The formulation of
596    subsystems as abstract machines
597    in which each row of the subsystem describes a specific cellular
598    machine that is believed to be present,
599  represents a way to maintain a collection of estimates or assertions.  represents a way to maintain a collection of estimates or assertions.
600  <p>  <p>A <b>protein family</b> is defined to be a set of
601  A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and  proteins that implement the same functional roles and
602  are similar over the entire lengths of the proteins.  are similar over the entire lengths of the proteins.
603  <p>  </p>
604  We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.  <p>We seek a situation in which each protein occurs in one or
605    more subsystems and in a single protein family.
606  The computational tasks imposed by such a goal are obvious:  The computational tasks imposed by such a goal are obvious:
607    </p>
608  <ul>  <ul>
609  <li>We need to consruct databases that implement at least the following entities:  <li>We need to consruct databases that implement at least the
610    following entities:
611  <ol>  <ol>
612  <li>cells (i.e., each cell must have an ID and a set of attributes),  <li>cells (i.e., each cell must have an ID and a set of
613    attributes),
614    </li>
615  <li>genomes,  <li>genomes,
616    </li>
617  <li>genes,  <li>genes,
618    </li>
619  <li>proteins,  <li>proteins,
620    </li>
621  <li>functional roles,  <li>functional roles,
622    </li>
623  <li>subsystems, and  <li>subsystems, and
624    </li>
625  <li>protein families.  <li>protein families.
626    </li>
627  </ol>  </ol>
628  <li> We need to add support for developing clues to function by integrating data  </li>
629    <li> We need to add support for developing clues to function by
630    integrating data
631  from sources like proximity within the genome, fusions, etc.  from sources like proximity within the genome, fusions, etc.
632  <li>We need to support a framework for the development of populated subsystems.  </li>
633  <li>We need to construct decision procedures for membership in protein families.  Some  <li>We need to support a framework for the development of
634  of these procedures will be quite complex, although the majority of cases can be  populated subsystems.
635    </li>
636    <li>We need to construct decision procedures for membership in
637    protein families. Some of these procedures will be quite complex,
638    although the majority of cases can be
639  handled by fairly general procedures.  handled by fairly general procedures.
640    </li>
641  </ul>  </ul>
642    <h3>States of the Cell</h3>
643  <h3>States of the Cell</h2>  The notion of <i>subsystem</i> was introduced as an <i>abstract
644    machine</i> -- that is, as an
645  The notion of <i>subsystem</i> was introduced as an <i>abstract machine</i> -- that is, as an  attempt to create a framework for understanding variations within
646  attempt to create a framework for understanding variations within specific celular machines via  specific celular machines via
647  a form of comparative analysis.  a form of comparative analysis. In any specific cell, sets of specific
648    cellular machines are switched on and off as units. That is, they are <i>co-regulated</i>.
649  In any specific cell, sets of specific cellular machines are  We will call such a set
650  switched on and off as units.  That is, they are <i>co-regulated</i>.  We will call such a set  of <i>co-regulated cellular machines</i> a <b>regulon</b>
651  of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing  (note that a regulon is often a set containing
652  a single cellular machine).  A <b>state</b> of a cell will be defined  a single cellular machine). A <b>state</b> of a cell will
653  as the set of regulons that are operational at a point in time.  Thus, a state amounts to the set  be defined
654    as the set of regulons that are operational at a point in time. Thus, a
655    state amounts to the set
656  of cellular machines that are operational at one instant.  of cellular machines that are operational at one instant.
657  <p>  <p>If we think of a car as a bag of machines that interact to
658  If we think of a car as a bag of machines that interact to make it function, we might consider there  make it function, we might consider there
659  to be a huge number of states.  There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off.  However, we can divide the states of a car into major groupings based on the status  to be a huge number of states. There are many very minor "machines"
660  of some key "machines".  For example, "off" (the state in which the engine is turned off and the car is parked) and  like the arm rest (or the radio, or the night light) that can be on or
661  "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into  off. However, we can divide the states of a car into major groupings
662    based on the status
663    of some key "machines". For example, "off" (the state in which the
664    engine is turned off and the car is parked) and
665    "on" (the engine is running and the car is moving) might be viewed as a
666    crude partitioning of the states into
667  two "major states".  two "major states".
668  <p>  </p>
669  Similarly, I believe that we should think about <i>major states of the cell</i> as being determined by the  <p>Similarly, I believe that we should think about <i>major
670  functioning (or not) of a limited set of regulons.  The determination of these regulons, the major states,  states of the cell</i> as being determined by the functioning (or
671  and how transitions between are managed all are now parts of the picture being filed in.  not) of a limited set of regulons. The determination of these regulons,
672    the major states,
673    and how transitions between are managed all are now parts of the
674  <h3>Microarrays</h2>  picture being filed in.
675    </p>
676    <h3>Microarrays</h3>
677    Microarrays are, for a given genome, two lists of genes that "changed
678    expression levels" between two states of a
679    cell. Basicaly, the first list contains genes that were "active" during
680    the first state, but not the second; and the
681    second list contains genes that were "active" in the second but not the
682    first. If a cellular
683    machine utilizes protein <i>X</i>, and <i>X</i>
684    is in the first list, and if <i>X</i> is used in
685    only one cellular machine, then it would be reasonable to infer that
686    you could say that the machine was
687    active in the first state, but not the second. If one knew the regulons
688    for a specific cell, it would go
689    a long way to suport extraction of insights from these microarrays. On
690    the other hand, if one had many,
691    many microarrays, and if the specific cellular machines for the cell
692    are known, then one could make
693    substantial progress in uncovering the exact composition of the
694    regulons that make up the cell.<br>
695    <br>
696    We are just now reaching the point where we do, in fact, have hundreds
697    of microarrays (each representing changes between two sampled states of
698    the cell). &nbsp;<br>
699    Let us reflect on how one might use this data to uncover the regulons
700    that are represented and how they relate to the major "states of the
701    cell".<br>
702    <br>
703    We might begin by trying to determine sets of genes from each subsystem
704    that appear to "move together". &nbsp; Actually, we want to arrive
705    at a set of genes that perform a well-defined function, some subset of
706    these almost always show up in the microarrays as "moving together".
707    &nbsp;Of these, if we have genes that occur only in a single
708    subsystem, then it would be reasonable as thinking of these as <span style="font-style: italic;">signatures</span> for set
709    of genes. &nbsp;The most natural way to do this would be to start
710    with metabolic subsystems, or even better <span style="font-style: italic;">scenarios (</span>discussed
711    below) which are subsets of functional roles from a metabolic subsystem
712    such that the subset if a connected set with well-defined inputs and
713    outputs. &nbsp;We wish then to define discovery of the regulon sets
714    associated with each condition as follows:<br>
715    <br>
716    <ol>
717    <li>&nbsp;First, for each scenario define&nbsp;</li>
718    <ul>
719    <li>the set of genes that are expected to show up in a
720    microarray when the scenario is activated or deactivated (call this
721    "the set of genes that move together" = <span style="font-style: italic;">SGMT for the scenario),</span></li>
722    <br>
723    <li>the subset of genes (perhaps empty) of the SGMT that are <span style="font-style: italic;">signatures</span> (call
724    this <span style="font-style: italic;">signatures of the
725    scenario)</span></li>
726    </ul>
727    <br>
728    <li>Then define the <span style="font-style: italic;">set
729    of regulons</span>. &nbsp;Each regulon is &nbsp;a set of
730    scenarios. &nbsp;There is a cost <span style="font-weight: bold;">cost_reg</span> associated
731    with the definition of each regulon. &nbsp;This prevents the
732    definition of numerous regulons all containing just one scenario.
733    &nbsp;If the penalty is set too high, only one regulon will be
734    defined. &nbsp;If it is set too low, then a large set of small
735    regulons results.</li>
736    <br>
737    <li>Finally, you need to define the set of regulons that were
738    activated for each microarray and the set that were deactivated.</li>
739    <br>
740    <li>Now, you compute a score for your decisions as&nbsp;<span style="font-weight: bold;">score = P - M - (cost_reg *
741    number_of_defined_regulons * number_of_microarrays)</span> where</li>
742    <br>
743    <ul>
744    <li><span style="font-weight: bold;">P</span>
745    = <span style="font-weight: bold;">p1 + p2,</span>
746    where&nbsp;<span style="font-weight: bold;"></span></li>
747    <br>
748    <ul>
749    <li><span style="font-weight: bold;">p1</span>
750    = <span style="font-weight: bold;">a1 * value_signature </span>and
751    <span style="font-weight: bold;">a1</span>
752    is the number of signatures of scenarios that moved as predicted, and <span style="font-weight: bold;">value_signature </span>is
753    the value associated with a signature moving in the direction predicted,</li>
754    <br>
755    <li><span style="font-weight: bold;">p2 = a2 *
756    value_SGMT_nonsig</span> and <span style="font-weight: bold;">a2</span>
757    is the number of SGMT genes that moved as predicted, and <span style="font-weight: bold;">value_SGMT_nonsig</span> is
758    the value associated with a non-signature SGMT gene moving in the
759    direction predicted, and</li>
760    <br>
761    </ul>
762    <li><span style="font-weight: bold;">M = m1 +
763    m2, where</span></li>
764    <br>
765    <ul>
766    <li><span style="font-weight: bold;">m1 = b1 *
767    value_signature</span> and <span style="font-weight: bold;">b1</span>
768    is the number of signatures of scenarios that did not move as
769    predicted, &nbsp;and</li>
770    <br>
771    <li><span style="font-weight: bold;">m2 = b2 *
772    value_SGMT_nonsig </span>and <span style="font-weight: bold;">b2</span>
773    is the number of SGMT genes that did not move as predicted.&nbsp;</li>
774    </ul>
775    <br>
776    The&nbsp;<span style="font-weight: bold;">score </span>reflects
777    how well your decisions in the first three steps match the data in the
778    microarrays. &nbsp;The object is to make the sets of decisions in
779    the first three steps in a way that maximizes the&nbsp;<span style="font-weight: bold;">score.<br>
780    <br>
781    </span><span style="font-weight: bold;"></span><br>
782    <span style="font-weight: bold;"></span><span style="font-weight: bold;"></span>
783    </ul>
784    </ol>
785    
 Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a  
 cell.  Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the  
 second list contains genes that were "active" in the second but not the first.  If a cellular  
 machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in  
 only one cellular machine, then it would be reasonable to infer that you could say that the machine was  
 active in the first state, but not the second.  If one knew the regulons for a specific cell, it would go  
 a long way to suport extraction of insights from these microarrays.  On the other hand, if one had many,  
 many microarrays, and if the specific cellular machines for the cell are known, then one could make  
 substantial progress in uncovering the exact composition of the regulons that make up the cell.  
786    
787  <h2>Notes for the Enhanced Abstraction</h2>  <h2>Notes for the Enhanced Abstraction</h2>
788    The process of <b>expressing a gene</b> amounts to using
789  The process of <b>expressing a gene</b> amounts to using the gene to produce the functional component of  the gene to produce the functional component of
790  a machine (a protein for a protein-encoding gene, and an RNA for an RNA-encoding gene).  a machine (a protein for a protein-encoding gene, and an RNA for an
791  The process of expressing a protein-encoding gene takes a gene (a string of DNA formed by concatenating a sequence of  RNA-encoding gene).
792  regions from contigs) and producing a protein is normally thought of as taking place in two steps.  The process of expressing a protein-encoding gene takes a gene (a
793  <b>Transcription</b> is the process of a specific machine moving along the contig and making a copy of the  string of DNA formed by concatenating a sequence of
794  gene as RNA.  This string of RNA is then <b>translated</b> by a separate machine.  The machine that performs  regions from contigs) and producing a protein is normally thought of as
795  the copying of the gene into a string of RNA is called an <b>RNA polymerase</b>.  The machine to translate  taking place in two steps.
796  the RNA into a protein, the <b>ribosome</b>, is made up of both proteins and RNA components.  <b>Transcription</b> is the process of a specific machine
797  <p>  moving along the contig and making a copy of the
798  Machines can be made up of both protein and RNA components, although most machines are built from  gene as RNA. This string of RNA is then <b>translated</b>
799  just proteins. Some of the most fundamental questions in biology relate to how life started and the steps  by a separate machine. The machine that performs
800  required to gradually enrich the basic machinery to the point where this magnificent information storage and  the copying of the gene into a string of RNA is called an <b>RNA
801  maintenance system based on DNA, RNA and proteins could have arisen.  There is much that can be inferred by  polymerase</b>. The machine to translate
802  reasoning back from what we now observe and reasoning forward from the relatively little we know of  the RNA into a protein, the <b>ribosome</b>, is made up of
803  what the early earth was like. One possible set of goals would be to first understand in detail the inventory  both proteins and RNA components.
804  of components we now see in life forms, composing something analogous to a CAD/CAM system describing life forms.  <p>Machines can be made up of both protein and RNA components,
805  Then, as a second step, to understand the sequence of transformations that led from some initial raw components  although most machines are built from
806    just proteins. Some of the most fundamental questions in biology relate
807    to how life started and the steps
808    required to gradually enrich the basic machinery to the point where
809    this magnificent information storage and
810    maintenance system based on DNA, RNA and proteins could have arisen.
811    There is much that can be inferred by
812    reasoning back from what we now observe and reasoning forward from the
813    relatively little we know of what the early earth was like. One
814    possible set of goals would be to first understand in detail the
815    inventory
816    of components we now see in life forms, composing something analogous
817    to a CAD/CAM system describing life forms.
818    Then, as a second step, to understand the sequence of transformations
819    that led from some initial raw components
820  to initial life forms to those we have seen and characterized.  to initial life forms to those we have seen and characterized.
821  <p>  </p>
822  The need to allow occasional "nonstandard" characters in protein sequences and a loosening of the corespondence  <p>The need to allow occasional "nonstandard" characters in
823  between a gene and characters in the protein sequence it can be used to build results from the fact that  protein sequences and a loosening of the corespondence
824  evolution has produced the existing genetic codes and they continue to evolve (either converging or diverging  between a gene and characters in the protein sequence it can be used to
825  depending on the outcome of basically random processes operating under selective pressure).  build results from the fact that
826    evolution has produced the existing genetic codes and they continue to
827    evolve (either converging or diverging
828    depending on the outcome of basically random processes operating under
829    selective pressure).
830  <br>  <br>
831    </p>
832  <h2>Notes on the Abstraction Extended to Support Regulation</h2>  <h2>Notes on the Abstraction Extended to Support Regulation</h2>
833    There are two basically different regulatory mechanisms in the cell. In
834  There are two basically different regulatory mechanisms in the cell.  In one, you have a metabolic  one, you have a metabolic
835  network in which fluxes are tightly controlled by positive and negative feeback loops. This <b>metabolic  network in which fluxes are tightly controlled by positive and negative
836  regulation</b> occurs very rapidly.  <b>Transcriptional regulation</b> occurs orders of magnitude more  feeback loops. This <b>metabolic
837  slowly.  It is just this transcriptional regulation that we consider in this extension.  regulation</b> occurs very rapidly. <b>Transcriptional
838  <p>  regulation</b> occurs orders of magnitude more slowly. It is just
839    this transcriptional regulation that we consider in this extension.
840  As the cell changes state, regulons are activated or de-activated by  <p>As the cell changes state, regulons are activated or
841    de-activated by
842  transcriptional regulators (either protein or RNA) binding to specific  transcriptional regulators (either protein or RNA) binding to specific
843  sites in the DNA.  This model has the redeeming characteristic of  sites in the DNA.  This model has the redeeming characteristic of
844  simplicity.  It is certainly the case that there are innumerable  simplicity.  It is certainly the case that there are innumerable
# Line 606  Line 852 
852  control sites within the genome is a major form of regulation and  control sites within the genome is a major form of regulation and
853  probably the right place to start any effort to formulate a useful  probably the right place to start any effort to formulate a useful
854  abstraction.  abstraction.
855    </p>
856  <h1>The Role of Bioinformatics in Supporting the Genomic Revolution</h1>  <h1>The Role of Bioinformatics in Supporting the Genomic
857    Revolution</h1>
858  Within the growing genomics revolution, one can easily divide developments and  Within the growing genomics revolution, one can easily divide
859  goals into those relating to advances in medicine and agricultue from those relating to  developments and
860  pure science.  Here we consider only issues relating to pushing advances in basic research.  goals into those relating to advances in medicine and agricultue from
861    those relating to
862    pure science. Here we consider only issues relating to pushing advances
863    in basic research.
864  Here is an overview of our perspective:  Here is an overview of our perspective:
865  <ol>  <ol>
866  <li> The different life forms that now exist were produced by an evolutionary process,  <li> The different life forms that now exist were produced by
867  which leads to our view that comparative analysis is the key to understanding.  Biological  an evolutionary process,
868  machines that exist in complex forms will often also still exist in simpler forms (usually  which leads to our view that comparative analysis is the key to
869    understanding. Biological
870    machines that exist in complex forms will often also still exist in
871    simpler forms (usually
872  in simpler organisms).  in simpler organisms).
873  <li> Unravelling exactly how a machine works is more easily done in simpler organisms.  They  </li>
874  are easier to work with, and it is easier to gather the data needed to support comparative analysis.  <li> Unravelling exactly how a machine works is more easily
875    done in simpler organisms. They
876  <li> This leads to the view that we should try to understand single-celled organisms to lay  are easier to work with, and it is easier to gather the data needed to
877    support comparative analysis.
878    </li>
879    <li> This leads to the view that we should try to understand
880    single-celled organisms to lay
881  the foundation for analysis of multicelluar organisms.  the foundation for analysis of multicelluar organisms.
882    </li>
883  <li> The characterization of unicellular life will require access to orders of magnitude  <li> The characterization of unicellular life will require
884  more data than exist now (we have more-or-less complete genomes for about 1000 genomes, but  access to orders of magnitude
885  that represents a small fraction of a percent of extant single-celled life forms).  more data than exist now (we have more-or-less complete genomes for
886    about 1000 genomes, but
887  <li> The immediate basic steps that are taking place are roughly:  that represents a small fraction of a percent of extant single-celled
888    life forms).
889    </li>
890    <li> The immediate basic steps that are taking place are
891    roughly:
892    <br>
893    <br>
894  <ol>  <ol>
895  <li> Attempt to formulate a growing list of abstract machines that correspond  <li> Attempt to formulate a growing list of abstract
896  to the many specific machines that implement te same goal.  These abstract machines (subsystems)  machines that correspond
897    to the many specific machines that implement te same goal. These
898    abstract machines (subsystems)
899  represent the basic units that make up life forms.  represent the basic units that make up life forms.
900    </li>
901  <li> Create protein and RNA families in which the members are all homologous (share a common ancestor),  <li> Create protein and RNA families in which the members
902  remain similar over almost all of the sequence, and all implement a common function.  are all homologous (share a common ancestor),
903    remain similar over almost all of the sequence, and all implement a
904  <li> Build alignments for each protein family, along with phylogenetic trees that represent  common function.
905    </li>
906    <li> Build alignments for each protein family, along with
907    phylogenetic trees that represent
908  an estimate of the history of how these specific sequences evolved.  an estimate of the history of how these specific sequences evolved.
909    </li>
910  <li>Provide a computational framework to support continued maintenance and development of these  <li>Provide a computational framework to support continued
911  basic data types.  maintenance and development of these
912    basic data types.</li>
913  </ol>  </ol>
914    <br>
915  <li> A limited number of groups have progressed to the point where they can create models of  Groups are now actively pursuing all of these goals. &nbsp;For
916  an organism that display predictive capabilities.  There are many forms of modeling.  In our view  individuals wishing to build a research program, we suggest
917  it is important that we reach the state where we can routinely model states of the cell, transitions  collaborating with an existing group or moving to one of the newer
918  between states, and metabolic characteristics of the cell.  We believe that it is now possible  areas that are now emerging.
919  to create fairly comprehensive representations of the metabolic networks of some bacteria.  </li>
920  In these cases, we have substantial amounts of physiological data, the number of abstract machines  <br>
921  in the cell is fairly limited, and it is possible to do compare the predictions against observed results.  <li> A limited number of groups have progressed to the point
922    where they can create models of an organism that display predictive
923    capabilities. There are many forms of modeling. In our view
924    it is important that we reach the state where we can routinely model
925    states of the cell, transitions
926    between states, and metabolic characteristics of the cell. We believe
927    that it is now possible
928    to create fairly comprehensive representations of the metabolic
929    networks of some bacteria. In these cases, we have substantial amounts
930    of physiological data, the number of abstract machines
931    in the cell is fairly limited, and it is possible to do compare the
932    predictions against observed results. &nbsp; An effort has begun by
933    a
934    team within the SEED project, led by researchers from Hope Colege, to
935    develop a library of what they call&nbsp;<span style="font-style: italic;">scenarios</span>.
936    &nbsp;These scenarios capture the idea of a specific machine
937    implementing a metabolic transformation operating with well-defined
938    inputs and outputs. From a large and growing number of scenarios in
939    this library, they automatically reconstruct metabolic networks for
940    most of the bacteria for which genomes have been sequenced.
941    &nbsp;This
942    effort is seeting the stage for widespread whole genome metabolic
943    modeling.&nbsp;</li>
944    <li>Rapid progress has been made in our ability to
945    recognize regulatory binding sites and to use them with knowledge of
946    specific machines to create a consistent picture of regulons in some
947    bacteria. &nbsp;This technology has been gathering adherents over
948    the
949    last five years and we believe that it will play a significant role in
950    clarifying regulons, additions proteins that will be added to specific
951    machines, and a growing understanding of states of the cell.&nbsp;</li>
952  </ol>  </ol>
953    Having said all that, is it possible to list some of the
954    important, high-payout bioinformatic questions that are worth
955  <br><br>  pondering? &nbsp;Here is a list for your consideration:<br>
956  We do not actually know what machines are present in a cell.  We are in the midst of a grand  <br>
 effort to clarify which are there and what they do.  Reaching a point where we have a near  
 complete overview of the basic inventory is arguably the highest priority at this point (we ignore  
 the medical revolution and numerous other wonderful advances, but...).  
   
 The formulation of subsystems as abstract machines  
 in which each row of the subsystem describes a specific cellular machine that is believed to be present,  
 represents a way to maintain a collection of estimates or assertions.  
 <p>  
 A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and  
 are similar over the entire lengths of the proteins.  
 <p>  
 We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.  
 The computational tasks imposed by such a goal are obvious:  
 <ul>  
 <li>We need to consruct databases that implement at least the following entities:  
957  <ol>  <ol>
958  <li>cells (i.e., each cell must have an ID and a set of attributes),  <li>The
959  <li>genomes,  definition of the location of genes&nbsp; for bacterialial genomes
960  <li>genes,  needs cleaning up. &nbsp;The situation is made somewhat more
961  <li>proteins,  interesting by a growing use of sequencing technologies that produce
962  <li>functional roles,  systematic errors leading to numerous frameshifts and poorly called
963  <li>subsystems, and  start locations. &nbsp;Fixing these would be a problem of modest
964  <li>protein families.  difficulty and very modest reward. &nbsp;The situation in
965    eukaryotic
966    genomes is quite different. &nbsp;The problem of defining the genes
967    in
968    a eukaryotic genome is still quite unsolved, &nbsp;We conjecture
969    that</li>
970    <ul>
971    <li>the
972    key to progress is the use of sets of genomes (i.e., solve the problem
973    of defining the genes in a set of closely-related genomes first), and</li>
974    <li>begin
975    with the single-celled eukaryotic genomes first. &nbsp;There are
976    many
977    types of single-celled eukaryotes, and some of them will undoubtedly
978    offer major challenges. &nbsp;However, existing experience suggests
979    that there will be numerous <span style="font-style: italic;">fungal</span>
980    genomes available (for example) and that focusing on these would be a
981    much easier task than trying to face plants, animals, etc.</li>
982    </ul>
983    <li>The
984    creation of populated subsystems is essentially a task for expert
985    biologists. &nbsp;However, the tools to support the task are a
986    reasonable focus for bioinformatic projects. &nbsp;The tools needed
987    to
988    delicately separate the roles of paralogous proteins have been
989    illustrated in the works of Jensen and Bonner, among others.
990    &nbsp;These tools relate to use of alignments, trees and motifs to
991    define the decision procedures needed to classify proteins into one of
992    several closely-related families.</li>
993    <li>The development &nbsp;of a
994    self-consistent set of protein families is a task closely related to
995    the one above. &nbsp;At this point in time there are several major
996    efforts currently building such protein families. &nbsp;The
997    development
998    of protocols for maintenance of the families, studying the evolutionary
999    history of related families, development of motifs that characterize
1000    specific families, and so forth all represent parts of a large
1001    classification problem.</li>
1002    <li>There are a class of tools that attempt to spot <span style="font-style: italic;">functional coupling</span>
1003    between specific proteins. &nbsp;Some are bioinformatic (like the
1004    chromosomal clustering and fusion phenomena briefly discussed above).
1005    &nbsp;Some are essentially experimental data (e.g., protein-protein
1006    interaction data or microarray data). &nbsp;The integration of
1007    evidence
1008    into a system capable of predicting whether &nbsp;or not two
1009    specific
1010    proteins are both components of a single machine has been attemtped,
1011    but much more remains to be done. &nbsp;The closely-related problem
1012    of
1013    determining whether or not two protein families are <span style="font-style: italic;">functionally coupled</span>
1014    (and precisely what that means) should be considered simultaneously.</li>
1015    <li>Defining
1016    regulons by gradually composing a consistent interpretation of
1017    subsystems, regulatory sites, and physiological data is a task that is
1018    semi-automated. &nbsp;Devlopment of a fully automated version seems
1019    too
1020    ambitious, but developing tools to increase the productivity of
1021    biologists developing these models of transcriptional regulation is
1022    certainly going to gain much more attention.</li>
1023    <li>Development of a meaningful notion of <span style="font-style: italic;">states of a cell</span>
1024    is a problem seems to us to have many of the characteristics one wants:
1025    &nbsp;it is a problem for which relevant data is starting to
1026    appear,
1027    many aspects of the needed infrastructure have only recently appeared,
1028    and the outcome may be of fundamental significance.</li>
1029    <li>To what
1030    extent is it possible to predict the protein families which have
1031    instances in a given cell given the closest 10 neighboring genomes and
1032    detailed information on the families they contain?</li>
1033    <li>Is it possible to think of a set of protein families as <span style="font-style: italic;">major predictors</span>
1034    that would allow you to infer the presence or absence of many other
1035    families.</li>
1036  </ol>  </ol>
1037  <li> We need to add support for developing clues to function by integrating data  <br>
1038  from sources like proximity within the genome, fusions, etc.  <ul>
 <li>We need to support a framework for the development of populated subsystems.  
 <li>We need to construct decision procedures for membership in protein families.  Some  
 of these procedures will be quite complex, although the majority of cases can be  
 handled by fairly general procedures.  
1039  </ul>  </ul>
1040    <h1> The Role of Abstraction in Setting the Stage for Software
1041    Development and Modeling</h1>
1042  <h1> The Role of Abstraction in Setting the Stage for Software Development and Modeling</h1>  In
1043    this section, we argue that the abstraction is much more than just a
1044    pedagogical aid. &nbsp;It will form the conceptual under-pinnings
1045    of
1046    the software needed to support work on the problems described in the
1047    last section (as well as numerous others that will become apparent as
1048    the revolution progresses).<br>
1049    </body></html>

Legend:
Removed from v.1.4  
changed lines
  Added in v.1.5

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3