[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Diff of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Mon Oct 1 20:12:41 2007 UTC revision 1.5, Tue Feb 12 20:25:05 2008 UTC
# Line 1  Line 1 
1  <div align=center>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2  <h1>The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:</h1>  <html><head><title>Abstraction Working Document</title>
3  <br>  
4    </head>
5    <body>
6    <div align="center">
7    <h1>The Role of Bioinformatics in Interpretating Genomes of
8    Unicellular Organisms:</h1>
9  <h1>An Abstract View</h1>  <h1>An Abstract View</h1>
10  <h2>by Ross Overbeek</h2>  <h2>by Ross Overbeek, ...</h2>
11  </div>  </div>
12    <h2>Introduction</h2>
13  <h2>What Is a Cell?</h2>  This strange document began as a tutorial for computer scientists and
14    mathematicians. It was supposed
15  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.  to somehow introduce them to the computational issues in genome
16  <p>  analysis.
17  By the term <b>compound</b> I refer to the normal notion of chemical compound.  It was requested by an instructor in a computer class. Overbeek in
18  <p>  attempting to respond to this request
19    formulated an abstraction that he began to believe had significance
20  A <b>cellular machine</b> is a set of proteins that together perform a function.   This function is often t  beyond the tutorial.
21  transform a set of compounds into another set.  Some types of machines (transport machines)  <p>This document is a set of working notes relating to the
22  are used to move compounds into  abstract. It is not organized properly as
23  or out of the cell.  an abstraction, a tutorial, or an essay on the role of bioinformatics
24  <p>  in support of biological research. It is,
25    however, organized properly as a working document that relates to all
26  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).  of these goals.
27  <p>  </p>
28    <p>It begins with a development of the abstraction. This will be
29  A <b>genome</b> is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).  suitable for mathematicians or computer scientists.
30  <p>  The abstraction is developed in four steps: the basic abstraction, the
31    enhanced abstraction needed to support
32  A <b>gene</b> is a region in the genome that describes how to build a  basic bioinformatics support for biologists, and finally the third step
33    which includes suport for the notion
34    of regulation. The intent throughout this discussion will be to seek a
35    minimal set of concepts needed to
36    effectively capture the essence of the required data. Unlike almost all
37    efforts to lay a foundation
38    for tutorials, software or research in biology, this effort focuses on
39    leaving out as much as possible.
40    While we do believe that there is an almost unlimited complexity that
41    can be introduced, and almost all of
42    it is needed for some specific goals, the vast majority of tools and
43    discussions require (we believe) relatively few
44    concepts. As they say, "the proof is in the pudding."
45    </p>
46    <p>The second section will feature a bit more tutorial comments.
47    It may well repeat much of what is in Part 1.
48    This part is offered as a way of easing a computer scientist of
49    mathematician into the issues that need to be
50    considered, if they wish to try to do useful research relating to the
51    genomics revolution. Eventually, this part
52    will be dramatically expanded by giving condensed summaries of the
53    machines of the cell broken into two broad
54    sets: the metabolic network and the cellular machinery not directly
55    included in the metabolic network. Loosely,
56    this separates what would be learned in a microbial biochemistry class
57    (when they exist) from what would
58    be learned in a course on molecular biology.
59    </p>
60    <p>The third part is an essay is an attempt to characterize our
61    view on </p>
62    <ul>
63    <li> what the main goals should be in current efforts to
64    advance biological knowledge via genome research,
65    </li>
66    <li> what role bioinformatics researchers have played in the
67    past, and
68    </li>
69    <li> what role they could productively play during the coming
70    few years.
71    </li>
72    </ul>
73    As such, it is undoubtedly an arrogant formulation by a group of
74    individuals with minimal background in
75    biology.
76    <p>The fourth section will focus on the imlications of the
77    abstractions in software development.
78    This is a bit of a radical proposal that makes sense to us (and is in
79    an area that we can
80    legitimately claim expertise).
81    </p>
82    <h1>Part 1: The Abstractions</h1>
83    <h2>The cell: a Minimal Perspective</h2>
84    A <b>cell</b> is a bag (i.e., a volume enclosed by a
85    membrane) that contains three types of things: compounds, cellular
86    machines, and a genome.
87    <p>By the term <b>compound</b> we refer to the
88    normal notion of chemical compound. </p>
89    <p>A <b>cellular machine</b> is a set of proteins
90    that together perform a function. Unless otherwise noted,
91    when we use the term <i>machine</i> we will always be
92    speaking of a cellular machine.
93    Many machines
94    transform one set of compounds into another set. Some machines
95    (transport machines) are used to move compounds into
96    or out of the cell. Later we will try to convey a more comprehensive
97    notion of what functions are implemented
98    by machines that we understand.
99    </p>
100    <p>A <b>protein</b> is a string of amino acids
101    (i.e., a string in the 20-character alphabet
102    {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
103    </p>
104    <p>A <b>genome</b> is a string of DNA bases (i.e., a
105    string in the 4-character alphabet {A,C,G,T}).
106    </p>
107    <p>A <b>gene</b> is a region in the genome that
108    describes how to build a
109  protein.  The description is a sequence of 3-character codons.  Each  protein.  The description is a sequence of 3-character codons.  Each
110  codon corresponds to either a single amino acid or a stop codon.  codon corresponds to either a single amino acid or a stop codon.
111  There are three stop codons: {TAA,TAG,TGA}.  The genetic code is the  There are three stop codons: {TAA,TAG,TGA}.  The genetic code is the
112  table of correspondences between codons and amino acids:  table of correspondences between codons and amino acids:
113  <br><br>  <br>
114  <table border>  <br>
115  <tr><th>Amino Acid</th><th>Codons</th></tr>  <table border="1">
116  <tr><td>A</td> <td>GCT, GCC, GCA, GCG </td></tr>  <tbody>
117  <tr><td>C</td> <td>TGT, TGC</td></tr>  <tr>
118  <tr><td>D</td> <td>GAT, GAC</td></tr>  <th>Amino Acid</th>
119  <tr><td>E</td> <td>GAA, GAG</td></tr>  <th>Codons</th>
120  <tr><td>F</td> <td>TTT, TTC</td></tr>  </tr>
121  <tr><td>G</td> <td>GGT, GGC, GGA, GGG</td></tr>  <tr>
122  <tr><td>H</td> <td>CAT, CAC</td></tr>  <td>A</td>
123  <tr><td>I</td> <td>ATT, ATC, ATA</td></tr>  <td>GCT, GCC, GCA, GCG </td>
124  <tr><td>K</td> <td>AAA, AAG</td></tr>  </tr>
125  <tr><td>L</td> <td>TTA, TTG, CTT, CTC, CTA, CTG</td></tr>  <tr>
126  <tr><td>M</td> <td>ATG</td></tr>  <td>C</td>
127  <tr><td>N</td> <td>AAT, AAC</td></tr>  <td>TGT, TGC</td>
128  <tr><td>P</td> <td>CCT, CCC, CCA, CCG</td></tr>  </tr>
129  <tr><td>Q</td> <td>CAA, CAG</td></tr>  <tr>
130  <tr><td>R</td> <td>CGT, CGC, CGA, CGG, AGA, AGG</td></tr>  <td>D</td>
131  <tr><td>S</td> <td>TCT, TCC, TCA, TCG, AGT, AGC</td></tr>  <td>GAT, GAC</td>
132  <tr><td>T</td> <td>ACT, ACC, ACA, ACG</td></tr>  </tr>
133  <tr><td>V</td> <td>GTT, GTC, GTA, GTG</td></tr>  <tr>
134  <tr><td>W</td> <td>TGG</td></tr>  <td>E</td>
135  <tr><td>Y</td> <td>TAT, TAC</td></tr>  <td>GAA, GAG</td>
136  <tr><td>*</td> <td>TAG, TGA, TAA  [Stop codons]</td></tr>  </tr>
137    <tr>
138    <td>F</td>
139    <td>TTT, TTC</td>
140    </tr>
141    <tr>
142    <td>G</td>
143    <td>GGT, GGC, GGA, GGG</td>
144    </tr>
145    <tr>
146    <td>H</td>
147    <td>CAT, CAC</td>
148    </tr>
149    <tr>
150    <td>I</td>
151    <td>ATT, ATC, ATA</td>
152    </tr>
153    <tr>
154    <td>K</td>
155    <td>AAA, AAG</td>
156    </tr>
157    <tr>
158    <td>L</td>
159    <td>TTA, TTG, CTT, CTC, CTA, CTG</td>
160    </tr>
161    <tr>
162    <td>M</td>
163    <td>ATG</td>
164    </tr>
165    <tr>
166    <td>N</td>
167    <td>AAT, AAC</td>
168    </tr>
169    <tr>
170    <td>P</td>
171    <td>CCT, CCC, CCA, CCG</td>
172    </tr>
173    <tr>
174    <td>Q</td>
175    <td>CAA, CAG</td>
176    </tr>
177    <tr>
178    <td>R</td>
179    <td>CGT, CGC, CGA, CGG, AGA, AGG</td>
180    </tr>
181    <tr>
182    <td>S</td>
183    <td>TCT, TCC, TCA, TCG, AGT, AGC</td>
184    </tr>
185    <tr>
186    <td>T</td>
187    <td>ACT, ACC, ACA, ACG</td>
188    </tr>
189    <tr>
190    <td>V</td>
191    <td>GTT, GTC, GTA, GTG</td>
192    </tr>
193    <tr>
194    <td>W</td>
195    <td>TGG</td>
196    </tr>
197    <tr>
198    <td>Y</td>
199    <td>TAT, TAC</td>
200    </tr>
201    <tr>
202    <td>*</td>
203    <td>TAG, TGA, TAA [Stop codons]</td>
204    </tr>
205    </tbody>
206  </table>  </table>
 <br><br>  
 <hr>  
 This minimal notion of a cell is enough to explain some of the central  
 problems in bioinformatics:  
   
 <h3>Identify the genes within a genome</h3>  
   
 This problem simply involves taking a genome (a string of DNA) and locating  
 the set of genes it contains.  Does the existence of 100s of genomes (genomes  
 with at least some estimate of where the genes occur) effect how you might do this?  
   
 <h3>Given two proteins. "align" them in a way that minimizes some edit function.  </h3>  
   
 For example:  
207  <br>  <br>
208  <br>  <br>
209  <pre>  </p>
210    <hr>The process of building a protein as a string of amino acids
211  seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT  from the gene containing codons is
212  seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE  called <b>expressing</b> the gene.
213                                     ** *. :.:   .*: :**.:**..::***:*  :  :.  <br>
214    A <b>subsystem</b> (i.e., an abstract cellular machine) is
215  seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG  a set of functional roles.
216  seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG  Each protein implements one or more functional roles. The set of
217                    **: :*.*   : :: .**:*:::* * *:* *  :: * .*:. *:.:: .******  functional roles
218    implemented by the protein is called the <b>function of the
219  seq1            PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP  protein</b>. The function of a multifunctional
220  seq2            PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP  protein that implements {functional-role-1,functional-role-2} is
221                  *******.******::* ::    * * *.:*.*******:***:.* *: .::* :***  normally written as
222    <i>functional-role-1 / functional-role-2</i>.
223  seq1            RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR  <br>
224  seq2            KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR  <br>
225                  :**:* :*** * *** ** :: :** ** ****** ** *:**:  * *.******::*  A <b>populated subsystem</b> is a subsystem with an
226    attached spreadsheet. Each column
227  seq1            LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND  in the spreadsheet corresponds to a functional role in the subsystem,
228  seq2            FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND  and each row corresponds to
229                  :*.*  *** * *** :  * : :*:.*******::****:.*.:****:*****.* **  a specific genome. Each cell in the spreadsheet contains the genes from
230  </pre>  the corresponding genome
231    that implement the designated functional role (there may be 0 or more
232  shows an alignment of two proteins (called <i>seq1</i> and <i>seq2</i>).  such genes).
233    <br>
234  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>  <br>
235    We do not actually know what machines are present in a cell. We are in
236    the midst of a grand
237    effort to clarify which are there and what they do. The formulation of
238    subsystems as abstract machines
239    in which each row of the subsystem describes a specific cellular
240    machine that is believed to be present,
241    represents a way to maintain a collection of estimates or assertions.
242    <p>A <b>protein family</b> is defined to be a set of
243    proteins that implement the same functional roles and
244    are similar over the entire lengths of the proteins.
245    </p>
246    <p>We seek a situation in which each protein occurs in one or
247    more subsystems and in a single protein family.
248    </p>
249    <p>In any specific cell, sets of specific cellular machines are
250    switched on and off as units. That is, they are <i>co-regulated</i>.
251    We will call such a set
252    of <i>co-regulated cellular machines</i> a <b>regulon</b>
253    (note that a regulon is often a set containing
254    a single cellular machine). A <b>state</b> of a cell will
255    be defined
256    as the set of regulons that are operational at a point in time. Thus, a
257    state amounts to the set
258    of cellular machines that are operational at one instant.
259    </p>
260    <p>Microarrays are, for a given genome, two lists of genes that
261    "changed expression levels" between two states of a
262    cell. Basicaly, the first list contains genes that were "active" during
263    the first state, but not the second; and the
264    second list contains genes that were "active" in the second but not the
265    first. If a cellular
266    machine utilizes protein <i>X</i>, and <i>X</i>
267    is in the first list, and if <i>X</i> is used in
268    only one cellular machine, then it would be reasonable to infer that
269    you could say that the machine was
270    active in the first state, but not the second.
271    </p>
272    <h2>The cell: the Enhanced Formlation Needed to Support
273    Bioinformatics</h2>
274    In the enhanced abstraction, we need to losen up some concepts. In
275    particular,
276    <ul>
277    <li> A <b>genome</b> is a set of strings in a
278    4-character alphabet. Each of the strings
279    is called a <b>contig</b>. Note that the concept as
280    formulated covers both incomplete genomes and genomes with multiple
281    replicons.
282    </li>
283    <li>The genes within a genome are of two distinct types:
284    <ol>
285    <li>those that describe how to construct a protein (i.e.,
286    prtein-encoding genes), and
287    </li>
288    <li>those that describe how to construct a string of RNA
289    (i.e., how to construct a string in the
290    4-character RNA alphabet {A,C,G,U}).
291    </li>
292    </ol>
293    <br>
294    <br>
295    </li>
296    <li>The location of a gene is generalized to be a set of
297    regions within the genome (that are
298    concatenated to form the instructions needed to construct either a
299    protein or a string of RNA).
300    </li>
301    <li>A protein is a character in an alphabet that now includes
302    the 20 character codes from
303    the basic abstraction plus a very limited set of extra codes. We
304    already have cases in which <i>selenocyctein</i> and <i>pyrrolysine</i>
305    appear as nonstandard translations of codons, and there may eventually
306    be more.
307    </li>
308    <li>Each protein-encoding gene has both a DNA sequence (by
309    defintion) and a translation. However,
310    the translation is not required to exactly match what a codon-by-codon
311    translation of the DNA sequence
312    would produce. This allows us to handle the very rare instances in
313    which selenocystein occurs as the translatin
314    of TGA or pyrrolysine occurs as a translation of TAG (and others, if
315    necessary).
316    </li>
317    </ul>
318    This loosened up formulation represents a very minimal set of changes.
319    They should be left out of the
320    basic tutorial for computer scientists and mathematicians.
321    <h2>The cell: Adding the Concepts Needed to Discuss
322    Transcriptional Regulation</h2>
323    In the final version of the abstraction, we add the minimal set of
324    notions needed to support
325    analysis of transcriptional regulation. An <b>operon</b>
326    is a set of contiguous genes that are all on the same strand and are
327    all co-regulated. We consider a gene that is not co-regulated with any
328    adjacent genes
329    to be an operon composed of just itself. A <b>binding site</b>
330    is a small region of DNA (normally
331    occurring a short space ahead of an operon) that acts as a switch
332    turning the operon "on" or "off". When
333    a specific protein or expressed RNA called a <b>transcriptional
334    regulator</b> binds the site, it flips the switch. One or more
335    specific transcriptional regulators can bind a specific site (i.e.,
336    sets of sites are associated with each specific transcriptional
337    regulator). The effect of a regulator binding at a site
338    always has the same effect (either activating or deactivating the
339    operon), but which effect depends on
340    the site-regulator pair.
341    <h1>Part 1: Tutorial Notes</h1>
342    <h2>Notes for The Basic Abstraction</h2>
343    We will be speaking about organisms that are a single cell. At some
344    point life began on earth.
345    The single-celled organisms that we know of replicate producing copies
346    of themselves that have
347    genomes which usually have very, very similar content to that of the
348    parent cell. <b>Evolution</b> is the
349    process in which cells replicate with some alterations in their
350    genomes, are subjected to
351    <i>selective pressure</i>, and survive or not depending on
352    many somewhat random factors. The makeup of
353    cells (i.e., the genomes they contain and the machines that define what
354    they are capable of doing)
355    changes gradually (and sometimes not so gradually) as time passes.
356    <p>The original life forms that existed billions of years ago
357    have evolved into three broad categories of
358    life forms. That is, the evolutinary process led to early divisions,
359    and these led to three main
360    categories of single-celled organisms. We call these three forms the <b>archaea</b>,
361    the <b>bacteria</b>, and the <b>eukaryotes</b>.
362    A majority of the organisms for which we have acquired complete genomes
363    are from the bacteria, although the
364    numbers are rapidly growing for all three domains.
365    </p>
366    <p>The minimal notion of a cell is enough to explain some of the
367    basic
368    problems in bioinformatics:
369    </p>
370    <h3>Identify the genes within a genome</h3>
371    If we are to understand the contents of genomes, we will need to
372    locate the genes that occur in each genome. This problem simply
373    involves taking a genome (a
374    string of DNA) and locating the set of genes it contains. In the case
375    of bacteria and archaea, we know pretty well how to
376    locate the genes. Once we
377    have identified instances from many genomes, it becomes possible to
378    recognize the genes in a new genome by just looking for things similar
379    to those we already understand. The following problem is At the heart
380    of reconizing when two
381    genes are "similar".
382    <h3>Given two genes. "align" them in a way that minimizes some
383    edit function. </h3>
384    For example, here is what you see when you align two genes from
385    distinct organisms:
386    <pre>gene1 ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC<br>gene2 ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC<br>* * * * * *** *** ** * **** * *** **** * ***<br>gene1 GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT<br>gene2 GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC<br>*** **** ** *** *** ** * **** *** ** * * *** * gene1 CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC<br>gene2 ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC<br>***** ***** *** **** * ** ** * ** *** ****** ***<br>gene1 ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG<br>gene2 ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG<br>********* **** ** ** ** ** ***** * ** ** ** *** ** *<br>gene1 GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC<br>gene2 GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC<br>** ** *** *********** ** **** *** ** * * ***<br>gene1 GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG<br>gene2 GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG<br>******** ** ***** ***** * * ** ** * *****<br>gene1 GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC<br>gene2 AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG<br>* **** * ********* ******* *** * *** ******** gene1 GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA<br>gene2 GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA<br>* ******** *** *** *** * ** ** * * ** ** * ****<br></pre>
387    <hr>
388    The sequences are recognizably similar, and in fact implement exactly
389    the same function
390    in the two cells. If we align the protein sequences corresponding to
391    these two
392    genes, we get
393    <pre>gene1 MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN<br>gene2 -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN<br> :* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***<br><br>gene1 IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT<br>gene2 IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E<br> ***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:<br><br>gene1 VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR<br>gene2 TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ<br> .:***** **.: :*.*** **::*:* * :**:.:*:<br></pre>
394    There is a great deal of work relating to recognizing when two
395    sequences are
396    similar and whether or not they had a common ancestor. Understanding
397    why
398    selective pressure conserves sections of sequences, but not others,
399    will yield
400    important clues. Can you reason out why some sections might be
401    conserved, while
402    others vary wildly?
403    <p>Comparing sets of sequences that have retained the same
404    function is
405    at the heart of understanding cellular machines and the proteins that
406    implement them. We find that looking at sets (often with more than two
407    sequences) and aligning them
408    is important.
409    </p>
410    <h3> Given a set of sequences, align them in a way that minimizes
411    some edit function.</h3>
412  Here is an example of a multiple sequence alignment:  Here is an example of a multiple sequence alignment:
413  <br>  <br>
414  <br>  <br>
415  <pre>  <pre>CLUSTAL W (1.83) multiple sequence alignment<br><br><br>seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE<br>seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA<br>seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN<br>seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT<br>seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE<br> *. . . .: :.: **..: ** .* : :<br><br>seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL<br>seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL<br>seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL<br>seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL<br>seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL<br> * : : .: :. .. :.. : : .*: . *<br><br>seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI<br>seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI<br>seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI<br>seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV<br>seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV<br> ************.. :::..:: . * : : :: *******:*. . :.:<br><br>seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV<br>seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV<br>seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV<br>seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA<br>seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA<br> :::*:.: * :* : *: .: : * ** ** :** * * : * :.<br><br>seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI<br>seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM<br>seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI<br>seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI<br>seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI<br> **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::<br><br>seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA<br>seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA<br>seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA<br>seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------<br>seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------<br> *.* *: . : : . : ::: :**: ..*: * : *.<br><br>seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC<br>seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP<br>seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL<br>seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ<br>seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG<br> . : : :** * . * :<br><br>seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA<br>seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR<br>seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA<br>seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK<br>seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS<br> : * ****.** ::: :* * * : : .:<br><br>seq3 FVSQHGNRGKPL<br>seq4 FMSGHLGA----<br>seq5 FIEKKAL-----<br>seq1 LMMNHQ------<br>seq2 YLLGK-------<br> : :<br></pre>
416  CLUSTAL W (1.83) multiple sequence alignment  <h3> Given a multiple sequence alignment, determine the most
417    likely evolutionary history of the sequences (i.e., construct a
418    phylogenetic tree).</h3>
419  seq3            -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE  From the extant five sequences that are similar and displayed in the
420  seq4            -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA  previous alignment, we can construct
421  seq5            -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN  a tree that depicts the "phylogenetic history" of the sequences.
422  seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT  Here is one reasonable tree for the last 5 sequences.
 seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE  
                                    *.  . .      .: :.:  **..: ** .*  :  :  
   
 seq3            EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL  
 seq4            ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL  
 seq5            TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL  
 seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL  
 seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL  
                        * :   : .:   :.   ..   :.. :  :         .*:   .     *  
   
 seq3            ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI  
 seq4            ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI  
 seq5            ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI  
 seq1            ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV  
 seq2            ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV  
                 ************.. :::..::  .    * : : :: *******:*.  .      :.:  
   
 seq3            FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV  
 seq4            FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV  
 seq5            FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV  
 seq1            VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA  
 seq2            YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA  
                  :::*:.: * :*   : *:   .:  :   * ** ** :**  * *  :      * :.  
   
 seq3            NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI  
 seq4            NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM  
 seq5            NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI  
 seq1            NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI  
 seq2            NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI  
                 **** :*.:.*  *** *  ::      .  . .**:****:: ** ..  :***: :::  
   
 seq3            VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA  
 seq4            IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA  
 seq5            LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA  
 seq1            AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------  
 seq2            AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------  
                  *.* *: . : :  . :       ::: :**:  ..*: * :  *.  
   
 seq3            FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC  
 seq4            FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP  
 seq5            LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL  
 seq1            -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ  
 seq2            -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG  
                                       .    : :                :** * .  *  :  
   
 seq3            RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA  
 seq4            GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR  
 seq5            VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA  
 seq1            LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK  
 seq2            LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS  
                         : * ****.** :::    :*     *  *            :  :   .:  
   
 seq3            FVSQHGNRGKPL  
 seq4            FMSGHLGA----  
 seq5            FIEKKAL-----  
 seq1            LMMNHQ------  
 seq2            YLLGK-------  
                  :  :  
 </pre>  
   
 <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>  
   
 Here is one reasonable tree for the last 5 sequences.  Note that we now have alignments that  
 contain thousands of sequences, and even displaying such trees is nontrivial.  
423  <pre>  <pre>
424                       ,--------------------------------------------------- seq1                       ,--------------------------------------------------- seq1
425                       |                       |
# Line 196  Line 444 
444    |    |
445    `---------------------------------------------- seq5    `---------------------------------------------- seq5
446  </pre>  </pre>
447    The tree suggests that at some point an ancestral
448  This is an <i>unrooted tree</i>, since we have no idea just looking at extant  cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>,
449  sequences about where the root should lie.  while the remaining sequences descend from the ther copy.
450    <p>Note that we now have alignments that
451  <h2>Some Random Facts that You Should Absorb</h2>  contain thousands of sequences, and even displaying such trees is
452    nontrivial.
453  Most genomes of bacteria contain between 400,000 and 12,000,000 characters.  Because evolution plays such a central role in the phenomena we study,
454  Normally, the genes in a genome  the construction of alignments
455    and trees in order to compare extant versions of proteins and gain
456    insight into their historical origins
457    is considered basic to the task at hand.
458    </p>
459    <h3>Some Random Facts that You Should Absorb</h3>
460    Most genomes of bacteria contain between 400,000 and 12,000,000
461    characters. Normally, the genes in a genome
462  cover abut 90% of the genome.  cover abut 90% of the genome.
463  Normally, there is about one gene per 1000 characters in a bacterial genome.  Normally, there is about one gene per 1000 characters in a bacterial
464  <p>  genome.
465  So,  <p>So, </p>
466  <ul>  <ul>
467  <li> What is the length of the average protein sequence?  <li> What is the length of the average protein sequence? </li>
468  <li>How many genes do these  <li>How many genes do these
469  genomes have?  genomes have? </li>
470  <li>What is the average length of a gene?  <li>What is the average length of a gene?
471    </li>
472  </ul>  </ul>
473  <br>  <br>
474  It is worth spending just a short bit of time thinking about what types of cellular  It is worth spending just a short bit of time thinking about what types
475  machines must exist.  Here are a few thoughts to start with  of
476    machines must exist in each cell. Here are a few thoughts to start with
477  <ul>  <ul>
478  <li>  <li>
479  There must be one or more machines that support replication of the cell.  You would  There must be one or more machines that support replication of the
480  need something to copy the genome, and you would need something that could build the DNA  cell. You would
481  bases that represent the characters (i.e., you will need machines to build the molecules  need something to copy the genome, and you would need something that
482  corresponding to each of the four characters in the alphabet of DNA bases.  could build the DNA
483  <li>  bases that represent the characters (i.e., you will need machines to
484  As we mentioned, you have transport machines that take things into and out of the cell.  Many  build the molecules
485  cells can import food in the form of sugar molecules.  For example, many cells can import  corresponding to each of the four characters in the alphabet of DNA
486  <i>glucose</i> a six-carbon compound.  As the compound gets broken down into smaller compounds,  bases.
487  energy is salvaged from the broken bonds to power the machines in the cell.  The smaller compounds  </li>
488    <li>As we mentioned, you have transport machines that take
489    things into and out of the cell. Many
490    cells can import food in the form of sugar molecules. For example, many
491    cells can import
492    <i>glucose</i> a six-carbon compound. As the compound
493    gets broken down into smaller compounds,
494    energy is salvaged from the broken bonds to power the machines in the
495    cell. The smaller compounds
496  are used as building blocks for other needs.  are used as building blocks for other needs.
497  <li>  </li>
498  There must be one or more machines involved in building proteins from the descriptions in te genes.  <li>There must be one or more machines involved in building
499  In particular, we will need a machine for each of the amino acids (unless the cell can import some  proteins from the descriptions in te genes.
500    In particular, we will need a machine for each of the amino acids
501    (unless the cell can import some
502  of them).  of them).
503  <li>  </li>
504  There must be mechanisms for sensing what is going on in the environment and allowing the cell  <li>There must be mechanisms for sensing what is going on in
505    the environment and allowing the cell
506  to react to it.  For example, many cells can "swim" towards food.  to react to it.  For example, many cells can "swim" towards food.
507    </li>
508  </ul>  </ul>
509  Those were just a few examples.  For any cell, we have many, many machines, and we still  Those were just a few examples. For any cell, we have many, many
510  do not even understand what some of them do.  machines, and we still
511  <p>  do not even understand what some of them do. Later, we will try to
512  About 50-60% of the genes occur within 5000 characters of another gene such that  offer a more structured
513  the two genes encode proteins that are part f the same cellular machine.  If you  estimate of what is already known.
514  had a genome in which the genes were identified, but the correspondence between the encoded  <p>About 50-60% of the genes occur within 5000 characters of
515  proteins and cellular machines was completely unknown, what could you learn using this fact?  another gene such that
516  Is the situation significantly different if you have 1000 genomes (let us say that  the two genes encode proteins that are part of the same cellular
517  you know where the genes occur, but the correspondence between the proteins and cellular machines  machine. This fact suggests that just having a large number of genomes
518  is completely unknown in each case).  would enable a person to group
519  <p>  the genes into the machines they implement, without the person
520  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in  understanding the functions
521  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and  of the machines or the roles played by each protein.
522  in most cells in which the proteins are not fused, the two distinct proteins are separate components  </p>
523  of a single machine.  How wuld you go about locating fused genes, and what could you learn from them?  <p>Occasionally, proteins that are usually distinct in most cells
524  <p>  are fused into a single protein in
525  Biologists have figured out the roles of about 50% of the genes.  That is, they can  a few cells. In these cases, the fused gene is (by definition) part of
526  place the gene in a cellular machine, they know what the machine does, and they know  a single machine, and
527  the specific role of the gene in sustaining the functionality of the machine.  in most cells in which the proteins are not fused, the two distinct
528  <br><br>  proteins are separate components
529    of a single machine. This, too, offers clues to support analysis of
530  <h2>Imposing a Structure on Characterizing the Inventory</h2>  which proteins go with which machines.
531    </p>
532  One central goal of bioinformatics is to support an accurate characterization of the cellular  <p>Biologists have figured out the roles of about 50% of the
533  machinery for each cell.  It is of major importance to biologsts that we be able to support  genes. That is, they can
534  comparative analysis of cells.  Perhaps, the most important aspect of understanding cells relates to  place the gene in a cellular machine, they know what the machine does,
535  their origin in an evolutionary process.  Cells have a long evolutionary history dating back billions of  and they know
536  years.  The machines we see in cells today arose in the past, so we expect to see many current cells  the specific role of the gene in sustaining the functionality of the
537  using machinery that resembles what turns up in other cells.  When we compare machines from different  machine.
538  cells they often look remarkably similar.  On the other hand, those that had a common origin in a cell that existed  <br>
539  billions of years in the past may now have versions that are not very similar.  Modifications, optimizations,  <br>
540  and insignificant alterations all combine to explore the space of operational possibilities for  <h23imposing a="" structure="" on="" characterizing="" the="" inventory="">
541  each type of machine.  Hence, we need a framework for studying similarities and differences in the  One central goal of bioinformatics is to support an accurate
542    characterization of the cellular
543    machinery for each cell. It is of major importance to biologsts that we
544    be able to support
545    comparative analysis of cells. Perhaps, the most important aspect of
546    understanding cells relates to
547    their origin in an evolutionary process. Cells have a long evolutionary
548    history dating back billions of
549    years. The machines we see in cells today arose in the past, so we
550    expect to see many current cells
551    using machinery that resembles what turns up in other cells. When we
552    compare machines from different
553    cells they often look remarkably similar. On the other hand, those that
554    had a common origin in a cell that existed billions of years in the
555    past may now have versions that are not very similar. Modifications,
556    optimizations,
557    and insignificant alterations all combine to explore the space of
558    operational possibilities for
559    each type of machine. Hence, we need a framework for studying
560    similarities and differences in the
561  cellular machines and the proteins that implement them.  cellular machines and the proteins that implement them.
562  <p>  </h23imposing></p>
563    <p>Here is a short formulation of one way to do this:
564  Here is a short formulation of one way to do this:  <br>
565  <br><br>  <br>
566    </p>
567  <ul>  <ul>
568  <li>A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.  <li>A <b>subsystem</b> (i.e., an abstract cellular
569  <li>Each protein implements one or more functional roles.  The set of functional roles  machine) is a set of functional roles.
570  implemented by the protein is called the <b>function of the protein</b>.  The function of a  multifunctional  </li>
571  protein that implements {functional-role-1,functional-role-2} is normally written as  <li>Each protein implements one or more functional roles. The
572    set of functional roles
573    implemented by the protein is called the <b>function of the
574    protein</b>. The function of a multifunctional
575    protein that implements {functional-role-1,functional-role-2} is
576    normally written as
577  <i>functional-role-1 / functional-role-2</i>.  <i>functional-role-1 / functional-role-2</i>.
578  <br><br>  <br>
579  <li>A <b>populated subsystem</b> is a subsystem with an attached spreadsheet.  Each column  <br>
580  in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to  </li>
581  a specific genome.  Each cell in the spreadsheet contains the genes from the corresponding genome  <li>A <b>populated subsystem</b> is a subsystem
582  that implement the designated functional role (there may be 0 or more such genes).  with an attached spreadsheet. Each column
583  </ul>  in the spreadsheet corresponds to a functional role in the subsystem,
584  <br><br>  and each row corresponds to
585  We do not actually know what machines are present in a cell.  We are in the midst of a grand  a specific genome. Each cell in the spreadsheet contains the genes from
586  effort to clarify which are there and what they do.  The formulation of subsystems as abstract machines  the corresponding genome
587  in which each row of the subsystem describes a specific cellular machine that is believed to be present,  that implement the designated functional role (there may be 0 or more
588    such genes).
589    </li>
590    </ul>
591    <br>
592    <br>
593    We do not actually know what machines are present in a cell. We are in
594    the midst of a grand
595    effort to clarify which are there and what they do. The formulation of
596    subsystems as abstract machines
597    in which each row of the subsystem describes a specific cellular
598    machine that is believed to be present,
599  represents a way to maintain a collection of estimates or assertions.  represents a way to maintain a collection of estimates or assertions.
600  <p>  <p>A <b>protein family</b> is defined to be a set of
601  A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and  proteins that implement the same functional roles and
602  are similar over the entire lengths of the proteins.  are similar over the entire lengths of the proteins.
603  <p>  </p>
604  We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.  <p>We seek a situation in which each protein occurs in one or
605    more subsystems and in a single protein family.
606  The computational tasks imposed by such a goal are obvious:  The computational tasks imposed by such a goal are obvious:
607    </p>
608  <ul>  <ul>
609  <li>We need to consruct databases that implement at least the following entities:  <li>We need to consruct databases that implement at least the
610    following entities:
611  <ol>  <ol>
612  <li>cells (i.e., each cell must have an ID and a set of attributes),  <li>cells (i.e., each cell must have an ID and a set of
613    attributes),
614    </li>
615  <li>genomes,  <li>genomes,
616    </li>
617  <li>genes,  <li>genes,
618    </li>
619  <li>proteins,  <li>proteins,
620    </li>
621  <li>functional roles,  <li>functional roles,
622    </li>
623  <li>subsystems, and  <li>subsystems, and
624    </li>
625  <li>protein families.  <li>protein families.
626    </li>
627  </ol>  </ol>
628  <li> We need to add support for developing clues to function by integrating data  </li>
629    <li> We need to add support for developing clues to function by
630    integrating data
631  from sources like proximity within the genome, fusions, etc.  from sources like proximity within the genome, fusions, etc.
632  <li>We need to support a framework for the development of populated subsystems.  </li>
633  <li>We need to construct decision procedures for membership in protein families.  Some  <li>We need to support a framework for the development of
634  of these procedures will be quite complex, although the majority of cases can be  populated subsystems.
635    </li>
636    <li>We need to construct decision procedures for membership in
637    protein families. Some of these procedures will be quite complex,
638    although the majority of cases can be
639  handled by fairly general procedures.  handled by fairly general procedures.
640    </li>
641  </ul>  </ul>
642    <h3>States of the Cell</h3>
643    The notion of <i>subsystem</i> was introduced as an <i>abstract
644    machine</i> -- that is, as an
645    attempt to create a framework for understanding variations within
646    specific celular machines via
647    a form of comparative analysis. In any specific cell, sets of specific
648    cellular machines are switched on and off as units. That is, they are <i>co-regulated</i>.
649    We will call such a set
650    of <i>co-regulated cellular machines</i> a <b>regulon</b>
651    (note that a regulon is often a set containing
652    a single cellular machine). A <b>state</b> of a cell will
653    be defined
654    as the set of regulons that are operational at a point in time. Thus, a
655    state amounts to the set
656    of cellular machines that are operational at one instant.
657    <p>If we think of a car as a bag of machines that interact to
658    make it function, we might consider there
659    to be a huge number of states. There are many very minor "machines"
660    like the arm rest (or the radio, or the night light) that can be on or
661    off. However, we can divide the states of a car into major groupings
662    based on the status
663    of some key "machines". For example, "off" (the state in which the
664    engine is turned off and the car is parked) and
665    "on" (the engine is running and the car is moving) might be viewed as a
666    crude partitioning of the states into
667    two "major states".
668    </p>
669    <p>Similarly, I believe that we should think about <i>major
670    states of the cell</i> as being determined by the functioning (or
671    not) of a limited set of regulons. The determination of these regulons,
672    the major states,
673    and how transitions between are managed all are now parts of the
674    picture being filed in.
675    </p>
676    <h3>Microarrays</h3>
677    Microarrays are, for a given genome, two lists of genes that "changed
678    expression levels" between two states of a
679    cell. Basicaly, the first list contains genes that were "active" during
680    the first state, but not the second; and the
681    second list contains genes that were "active" in the second but not the
682    first. If a cellular
683    machine utilizes protein <i>X</i>, and <i>X</i>
684    is in the first list, and if <i>X</i> is used in
685    only one cellular machine, then it would be reasonable to infer that
686    you could say that the machine was
687    active in the first state, but not the second. If one knew the regulons
688    for a specific cell, it would go
689    a long way to suport extraction of insights from these microarrays. On
690    the other hand, if one had many,
691    many microarrays, and if the specific cellular machines for the cell
692    are known, then one could make
693    substantial progress in uncovering the exact composition of the
694    regulons that make up the cell.<br>
695    <br>
696    We are just now reaching the point where we do, in fact, have hundreds
697    of microarrays (each representing changes between two sampled states of
698    the cell). &nbsp;<br>
699    Let us reflect on how one might use this data to uncover the regulons
700    that are represented and how they relate to the major "states of the
701    cell".<br>
702    <br>
703    We might begin by trying to determine sets of genes from each subsystem
704    that appear to "move together". &nbsp; Actually, we want to arrive
705    at a set of genes that perform a well-defined function, some subset of
706    these almost always show up in the microarrays as "moving together".
707    &nbsp;Of these, if we have genes that occur only in a single
708    subsystem, then it would be reasonable as thinking of these as <span style="font-style: italic;">signatures</span> for set
709    of genes. &nbsp;The most natural way to do this would be to start
710    with metabolic subsystems, or even better <span style="font-style: italic;">scenarios (</span>discussed
711    below) which are subsets of functional roles from a metabolic subsystem
712    such that the subset if a connected set with well-defined inputs and
713    outputs. &nbsp;We wish then to define discovery of the regulon sets
714    associated with each condition as follows:<br>
715    <br>
716    <ol>
717    <li>&nbsp;First, for each scenario define&nbsp;</li>
718    <ul>
719    <li>the set of genes that are expected to show up in a
720    microarray when the scenario is activated or deactivated (call this
721    "the set of genes that move together" = <span style="font-style: italic;">SGMT for the scenario),</span></li>
722    <br>
723    <li>the subset of genes (perhaps empty) of the SGMT that are <span style="font-style: italic;">signatures</span> (call
724    this <span style="font-style: italic;">signatures of the
725    scenario)</span></li>
726    </ul>
727    <br>
728    <li>Then define the <span style="font-style: italic;">set
729    of regulons</span>. &nbsp;Each regulon is &nbsp;a set of
730    scenarios. &nbsp;There is a cost <span style="font-weight: bold;">cost_reg</span> associated
731    with the definition of each regulon. &nbsp;This prevents the
732    definition of numerous regulons all containing just one scenario.
733    &nbsp;If the penalty is set too high, only one regulon will be
734    defined. &nbsp;If it is set too low, then a large set of small
735    regulons results.</li>
736    <br>
737    <li>Finally, you need to define the set of regulons that were
738    activated for each microarray and the set that were deactivated.</li>
739    <br>
740    <li>Now, you compute a score for your decisions as&nbsp;<span style="font-weight: bold;">score = P - M - (cost_reg *
741    number_of_defined_regulons * number_of_microarrays)</span> where</li>
742    <br>
743    <ul>
744    <li><span style="font-weight: bold;">P</span>
745    = <span style="font-weight: bold;">p1 + p2,</span>
746    where&nbsp;<span style="font-weight: bold;"></span></li>
747    <br>
748    <ul>
749    <li><span style="font-weight: bold;">p1</span>
750    = <span style="font-weight: bold;">a1 * value_signature </span>and
751    <span style="font-weight: bold;">a1</span>
752    is the number of signatures of scenarios that moved as predicted, and <span style="font-weight: bold;">value_signature </span>is
753    the value associated with a signature moving in the direction predicted,</li>
754    <br>
755    <li><span style="font-weight: bold;">p2 = a2 *
756    value_SGMT_nonsig</span> and <span style="font-weight: bold;">a2</span>
757    is the number of SGMT genes that moved as predicted, and <span style="font-weight: bold;">value_SGMT_nonsig</span> is
758    the value associated with a non-signature SGMT gene moving in the
759    direction predicted, and</li>
760    <br>
761    </ul>
762    <li><span style="font-weight: bold;">M = m1 +
763    m2, where</span></li>
764    <br>
765    <ul>
766    <li><span style="font-weight: bold;">m1 = b1 *
767    value_signature</span> and <span style="font-weight: bold;">b1</span>
768    is the number of signatures of scenarios that did not move as
769    predicted, &nbsp;and</li>
770    <br>
771    <li><span style="font-weight: bold;">m2 = b2 *
772    value_SGMT_nonsig </span>and <span style="font-weight: bold;">b2</span>
773    is the number of SGMT genes that did not move as predicted.&nbsp;</li>
774    </ul>
775    <br>
776    The&nbsp;<span style="font-weight: bold;">score </span>reflects
777    how well your decisions in the first three steps match the data in the
778    microarrays. &nbsp;The object is to make the sets of decisions in
779    the first three steps in a way that maximizes the&nbsp;<span style="font-weight: bold;">score.<br>
780    <br>
781    </span><span style="font-weight: bold;"></span><br>
782    <span style="font-weight: bold;"></span><span style="font-weight: bold;"></span>
783    </ul>
784    </ol>
785    
 <h2>Microarrays and States of the Cell</h2>  
786    
787  We wil think of a <b>regulon</b> as a set of subsystems.  A <b>state of the cell</b> is  <h2>Notes for the Enhanced Abstraction</h2>
788  defined as the set of regulons that are operational at a point in time.  The process of <b>expressing a gene</b> amounts to using
789  <p>  the gene to produce the functional component of
790  A <b>consistent microarray</b> (for our purposes) is  a machine (a protein for a protein-encoding gene, and an RNA for an
791    RNA-encoding gene).
792    The process of expressing a protein-encoding gene takes a gene (a
793    string of DNA formed by concatenating a sequence of
794    regions from contigs) and producing a protein is normally thought of as
795    taking place in two steps.
796    <b>Transcription</b> is the process of a specific machine
797    moving along the contig and making a copy of the
798    gene as RNA. This string of RNA is then <b>translated</b>
799    by a separate machine. The machine that performs
800    the copying of the gene into a string of RNA is called an <b>RNA
801    polymerase</b>. The machine to translate
802    the RNA into a protein, the <b>ribosome</b>, is made up of
803    both proteins and RNA components.
804    <p>Machines can be made up of both protein and RNA components,
805    although most machines are built from
806    just proteins. Some of the most fundamental questions in biology relate
807    to how life started and the steps
808    required to gradually enrich the basic machinery to the point where
809    this magnificent information storage and
810    maintenance system based on DNA, RNA and proteins could have arisen.
811    There is much that can be inferred by
812    reasoning back from what we now observe and reasoning forward from the
813    relatively little we know of what the early earth was like. One
814    possible set of goals would be to first understand in detail the
815    inventory
816    of components we now see in life forms, composing something analogous
817    to a CAD/CAM system describing life forms.
818    Then, as a second step, to understand the sequence of transformations
819    that led from some initial raw components
820    to initial life forms to those we have seen and characterized.
821    </p>
822    <p>The need to allow occasional "nonstandard" characters in
823    protein sequences and a loosening of the corespondence
824    between a gene and characters in the protein sequence it can be used to
825    build results from the fact that
826    evolution has produced the existing genetic codes and they continue to
827    evolve (either converging or diverging
828    depending on the outcome of basically random processes operating under
829    selective pressure).
830    <br>
831    </p>
832    <h2>Notes on the Abstraction Extended to Support Regulation</h2>
833    There are two basically different regulatory mechanisms in the cell. In
834    one, you have a metabolic
835    network in which fluxes are tightly controlled by positive and negative
836    feeback loops. This <b>metabolic
837    regulation</b> occurs very rapidly. <b>Transcriptional
838    regulation</b> occurs orders of magnitude more slowly. It is just
839    this transcriptional regulation that we consider in this extension.
840    <p>As the cell changes state, regulons are activated or
841    de-activated by
842    transcriptional regulators (either protein or RNA) binding to specific
843    sites in the DNA. This model has the redeeming characteristic of
844    simplicity. It is certainly the case that there are innumerable
845    important issues that it disregards (e.g., regulation based on DNA
846    packaging, due to small RNAs binding the RNAs produced by
847    transcription, etc.). In forming any clear notion of transcriptional
848    regulation and how it is achieved, we will need to carefully separate
849    these different mechanisms, since they have fundamentally different
850    modes of control and operation. We are arguing that the notion of a
851    protein or RNA being used to flip regulons on and off by binding to
852    control sites within the genome is a major form of regulation and
853    probably the right place to start any effort to formulate a useful
854    abstraction.
855    </p>
856    <h1>The Role of Bioinformatics in Supporting the Genomic
857    Revolution</h1>
858    Within the growing genomics revolution, one can easily divide
859    developments and
860    goals into those relating to advances in medicine and agricultue from
861    those relating to
862    pure science. Here we consider only issues relating to pushing advances
863    in basic research.
864    Here is an overview of our perspective:
865  <ol>  <ol>
866  <li> The ID of an experiment.  The experiment corresponds to two states of the cell, S1 and S2.  <li> The different life forms that now exist were produced by
867  <li> A list of proteins that are in in the regulons in S1, but not in those of S2.  an evolutionary process,
868  <li> A list of proteins that are in the regulons of S2, but not in those of S1.  which leads to our view that comparative analysis is the key to
869    understanding. Biological
870    machines that exist in complex forms will often also still exist in
871    simpler forms (usually
872    in simpler organisms).
873    </li>
874    <li> Unravelling exactly how a machine works is more easily
875    done in simpler organisms. They
876    are easier to work with, and it is easier to gather the data needed to
877    support comparative analysis.
878    </li>
879    <li> This leads to the view that we should try to understand
880    single-celled organisms to lay
881    the foundation for analysis of multicelluar organisms.
882    </li>
883    <li> The characterization of unicellular life will require
884    access to orders of magnitude
885    more data than exist now (we have more-or-less complete genomes for
886    about 1000 genomes, but
887    that represents a small fraction of a percent of extant single-celled
888    life forms).
889    </li>
890    <li> The immediate basic steps that are taking place are
891    roughly:
892    <br>
893    <br>
894    <ol>
895    <li> Attempt to formulate a growing list of abstract
896    machines that correspond
897    to the many specific machines that implement te same goal. These
898    abstract machines (subsystems)
899    represent the basic units that make up life forms.
900    </li>
901    <li> Create protein and RNA families in which the members
902    are all homologous (share a common ancestor),
903    remain similar over almost all of the sequence, and all implement a
904    common function.
905    </li>
906    <li> Build alignments for each protein family, along with
907    phylogenetic trees that represent
908    an estimate of the history of how these specific sequences evolved.
909    </li>
910    <li>Provide a computational framework to support continued
911    maintenance and development of these
912    basic data types.</li>
913  </ol>  </ol>
914  <br>  <br>
915  A <b>real microarray</b> is just two sets of proteins.  We have some notion (e.g., an ID) for  Groups are now actively pursuing all of these goals. &nbsp;For
916  each of two states of the cell, but no idea what regulons make up these states.  There is a  individuals wishing to build a research program, we suggest
917  substantial error rate in the two lists of proteins (e.g., some of the proteins in the first list either were not in  collaborating with an existing group or moving to one of the newer
918  S1 or they were in S2).  areas that are now emerging.
919  <p>  </li>
920  The interesting research question is, given a large list of real microarrays, can you  <br>
921  attach sets of regulons to all of the state IDs, and then give a minimal set of changes to the data  <li> A limited number of groups have progressed to the point
922  in the real microarrays needed to convert them to consistent microarrays.  where they can create models of an organism that display predictive
923    capabilities. There are many forms of modeling. In our view
924    it is important that we reach the state where we can routinely model
925    states of the cell, transitions
926    between states, and metabolic characteristics of the cell. We believe
927    that it is now possible
928    to create fairly comprehensive representations of the metabolic
929    networks of some bacteria. In these cases, we have substantial amounts
930    of physiological data, the number of abstract machines
931    in the cell is fairly limited, and it is possible to do compare the
932    predictions against observed results. &nbsp; An effort has begun by
933    a
934    team within the SEED project, led by researchers from Hope Colege, to
935    develop a library of what they call&nbsp;<span style="font-style: italic;">scenarios</span>.
936    &nbsp;These scenarios capture the idea of a specific machine
937    implementing a metabolic transformation operating with well-defined
938    inputs and outputs. From a large and growing number of scenarios in
939    this library, they automatically reconstruct metabolic networks for
940    most of the bacteria for which genomes have been sequenced.
941    &nbsp;This
942    effort is seeting the stage for widespread whole genome metabolic
943    modeling.&nbsp;</li>
944    <li>Rapid progress has been made in our ability to
945    recognize regulatory binding sites and to use them with knowledge of
946    specific machines to create a consistent picture of regulons in some
947    bacteria. &nbsp;This technology has been gathering adherents over
948    the
949    last five years and we believe that it will play a significant role in
950    clarifying regulons, additions proteins that will be added to specific
951    machines, and a growing understanding of states of the cell.&nbsp;</li>
952    </ol>
953    Having said all that, is it possible to list some of the
954    important, high-payout bioinformatic questions that are worth
955    pondering? &nbsp;Here is a list for your consideration:<br>
956    <br>
957    <ol>
958    <li>The
959    definition of the location of genes&nbsp; for bacterialial genomes
960    needs cleaning up. &nbsp;The situation is made somewhat more
961    interesting by a growing use of sequencing technologies that produce
962    systematic errors leading to numerous frameshifts and poorly called
963    start locations. &nbsp;Fixing these would be a problem of modest
964    difficulty and very modest reward. &nbsp;The situation in
965    eukaryotic
966    genomes is quite different. &nbsp;The problem of defining the genes
967    in
968    a eukaryotic genome is still quite unsolved, &nbsp;We conjecture
969    that</li>
970    <ul>
971    <li>the
972    key to progress is the use of sets of genomes (i.e., solve the problem
973    of defining the genes in a set of closely-related genomes first), and</li>
974    <li>begin
975    with the single-celled eukaryotic genomes first. &nbsp;There are
976    many
977    types of single-celled eukaryotes, and some of them will undoubtedly
978    offer major challenges. &nbsp;However, existing experience suggests
979    that there will be numerous <span style="font-style: italic;">fungal</span>
980    genomes available (for example) and that focusing on these would be a
981    much easier task than trying to face plants, animals, etc.</li>
982    </ul>
983    <li>The
984    creation of populated subsystems is essentially a task for expert
985    biologists. &nbsp;However, the tools to support the task are a
986    reasonable focus for bioinformatic projects. &nbsp;The tools needed
987    to
988    delicately separate the roles of paralogous proteins have been
989    illustrated in the works of Jensen and Bonner, among others.
990    &nbsp;These tools relate to use of alignments, trees and motifs to
991    define the decision procedures needed to classify proteins into one of
992    several closely-related families.</li>
993    <li>The development &nbsp;of a
994    self-consistent set of protein families is a task closely related to
995    the one above. &nbsp;At this point in time there are several major
996    efforts currently building such protein families. &nbsp;The
997    development
998    of protocols for maintenance of the families, studying the evolutionary
999    history of related families, development of motifs that characterize
1000    specific families, and so forth all represent parts of a large
1001    classification problem.</li>
1002    <li>There are a class of tools that attempt to spot <span style="font-style: italic;">functional coupling</span>
1003    between specific proteins. &nbsp;Some are bioinformatic (like the
1004    chromosomal clustering and fusion phenomena briefly discussed above).
1005    &nbsp;Some are essentially experimental data (e.g., protein-protein
1006    interaction data or microarray data). &nbsp;The integration of
1007    evidence
1008    into a system capable of predicting whether &nbsp;or not two
1009    specific
1010    proteins are both components of a single machine has been attemtped,
1011    but much more remains to be done. &nbsp;The closely-related problem
1012    of
1013    determining whether or not two protein families are <span style="font-style: italic;">functionally coupled</span>
1014    (and precisely what that means) should be considered simultaneously.</li>
1015    <li>Defining
1016    regulons by gradually composing a consistent interpretation of
1017    subsystems, regulatory sites, and physiological data is a task that is
1018    semi-automated. &nbsp;Devlopment of a fully automated version seems
1019    too
1020    ambitious, but developing tools to increase the productivity of
1021    biologists developing these models of transcriptional regulation is
1022    certainly going to gain much more attention.</li>
1023    <li>Development of a meaningful notion of <span style="font-style: italic;">states of a cell</span>
1024    is a problem seems to us to have many of the characteristics one wants:
1025    &nbsp;it is a problem for which relevant data is starting to
1026    appear,
1027    many aspects of the needed infrastructure have only recently appeared,
1028    and the outcome may be of fundamental significance.</li>
1029    <li>To what
1030    extent is it possible to predict the protein families which have
1031    instances in a given cell given the closest 10 neighboring genomes and
1032    detailed information on the families they contain?</li>
1033    <li>Is it possible to think of a set of protein families as <span style="font-style: italic;">major predictors</span>
1034    that would allow you to infer the presence or absence of many other
1035    families.</li>
1036    </ol>
1037    <br>
1038    <ul>
1039    </ul>
1040    <h1> The Role of Abstraction in Setting the Stage for Software
1041    Development and Modeling</h1>
1042    In
1043    this section, we argue that the abstraction is much more than just a
1044    pedagogical aid. &nbsp;It will form the conceptual under-pinnings
1045    of
1046    the software needed to support work on the problems described in the
1047    last section (as well as numerous others that will become apparent as
1048    the revolution progresses).<br>
1049    </body></html>

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.5

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3