[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Diff of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Mon Oct 1 20:12:41 2007 UTC revision 1.4, Thu Nov 15 21:06:50 2007 UTC
# Line 4  Line 4 
4  <h1>An Abstract View</h1>  <h1>An Abstract View</h1>
5  <h2>by Ross Overbeek</h2>  <h2>by Ross Overbeek</h2>
6  </div>  </div>
7    <h2>Introduction</h2>
8    This strange document began as a tutorial for computer scientists and mathematicians.  It was supposed
9    to somehow introduce them to the computational issues in genome analysis.
10    It was requested by an instructor in a computer class.  Overbeek in attempting to respond to this request
11    formulated an abstraction that he began to believe had significance beyond the tutorial.
12    <p>
13    This document is a set of working notes relating to the abstract.  It is not organized properly as
14    an abstraction, a tutorial, or an essay on the role of bioinformatics in support of biological research.  It is,
15    however, organized properly as a working document that relates to all of these goals.
16    <p>
17    It begins with a development of the abstraction.  This will be suitable for mathematicians or computer scientists.
18    The abstraction is developed in four steps: the basic abstraction, the enhanced abstraction needed to support
19    basic bioinformatics support for biologists, and finally the third step which includes suport for the notion
20    of regulation.  The intent throughout this discussion will be to seek a minimal set of concepts needed to
21    effectively capture the essence of the required data.  Unlike almost all efforts to lay a foundation
22    for tutorials, software or research in biology, this effort focuses on leaving out as much as possible.
23    While we do believe that there is an almost unlimited complexity that can be introduced, and almost all of
24    it is needed for some specific goals, the vast majority of tools and discussions require (we believe) relatively few
25    concepts.  As they say, "the proof is in the pudding."
26    
27    <p>
28    The second section will feature a bit more tutorial comments.  It may well repeat much of what is in Part 1.
29    This part is offered as a way of easing a computer scientist of mathematician into the issues that need to be
30    considered, if they wish to try to do useful research relating to the genomics revolution.  Eventually, this part
31    will be dramatically expanded by giving condensed summaries of the machines of the cell broken into two broad
32    sets: the metabolic network and the cellular machinery not directly included in the metabolic network.  Loosely,
33    this separates what would be learned in a microbial biochemistry class (when they exist) from what would
34    be learned in a course on molecular biology.
35    <p>
36    The third part is an essay is an attempt to characterize our view on
37    <ul>
38    <li> what the main goals should be in current efforts to advance biological knowledge via genome research,
39    <li> what role bioinformatics researchers have played in the past, and
40    <li> what role they could productively play during the coming few years.
41    </ul>
42    As such, it is undoubtedly an arrogant formulation by a group of individuals with minimal background in
43    biology.
44    <p>
45    The fourth section will focus on the imlications of the abstractions in software development.
46    This is a bit of a radical proposal that makes sense to us (and is in an area that we can
47    legitimately claim expertise).
48    
49  <h2>What Is a Cell?</h2>  <h1>Part 1: The Abstractions</h1>
50    <h2>The cell: a Minimal Perspective</h2>
51    
52  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
53  <p>  <p>
54  By the term <b>compound</b> I refer to the normal notion of chemical compound.  By the term <b>compound</b> we refer to the normal notion of chemical compound.
55  <p>  <p>
56    
57  A <b>cellular machine</b> is a set of proteins that together perform a function.   This function is often t  A <b>cellular machine</b> is a set of proteins that together perform a function. Unless otherwise noted,
58  transform a set of compounds into another set.  Some types of machines (transport machines)  when we use the term <i>machine</i> we will always be speaking of a cellular machine.
59    Many machines
60    transform one set of compounds into another set.  Some machines (transport machines)
61  are used to move compounds into  are used to move compounds into
62  or out of the cell.  or out of the cell.  Later we will try to convey a more comprehensive notion of what functions are implemented
63    by machines that we understand.
64  <p>  <p>
65    
66  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
# Line 56  Line 101 
101  </table>  </table>
102  <br><br>  <br><br>
103  <hr>  <hr>
104  This minimal notion of a cell is enough to explain some of the central  The process of building a protein as a string of amino acids from the gene containing codons is
105    called <b>expressing</b> the gene.
106    <br>
107    A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.
108    Each protein implements one or more functional roles.  The set of functional roles
109    implemented by the protein is called the <b>function of the protein</b>.  The function of a  multifunctional
110    protein that implements {functional-role-1,functional-role-2} is normally written as
111    <i>functional-role-1 / functional-role-2</i>.
112    <br><br>
113    A <b>populated subsystem</b> is a subsystem with an attached spreadsheet.  Each column
114    in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to
115    a specific genome.  Each cell in the spreadsheet contains the genes from the corresponding genome
116    that implement the designated functional role (there may be 0 or more such genes).
117    <br><br>
118    We do not actually know what machines are present in a cell.  We are in the midst of a grand
119    effort to clarify which are there and what they do.  The formulation of subsystems as abstract machines
120    in which each row of the subsystem describes a specific cellular machine that is believed to be present,
121    represents a way to maintain a collection of estimates or assertions.
122    <p>
123    A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
124    are similar over the entire lengths of the proteins.
125    <p>
126    We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
127    <p>
128    In any specific cell, sets of specific cellular machines are
129    switched on and off as units.  That is, they are <i>co-regulated</i>.  We will call such a set
130    of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
131    a single cellular machine).  A <b>state</b> of a cell will be defined
132    as the set of regulons that are operational at a point in time.  Thus, a state amounts to the set
133    of cellular machines that are operational at one instant.
134    <p>
135    Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
136    cell.  Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
137    second list contains genes that were "active" in the second but not the first.  If a cellular
138    machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
139    only one cellular machine, then it would be reasonable to infer that you could say that the machine was
140    active in the first state, but not the second.
141    
142    <h2>The cell: the Enhanced Formlation Needed to Support Bioinformatics</h2>
143    
144    In the enhanced abstraction, we need to losen up some concepts.  In particular,
145    <ul>
146    <li> A <b>genome</b> is a set of strings in a 4-character alphabet.  Each of the strings
147    is called a <b>contig</b>.  Note that the concept as formulated covers both incomplete genomes and
148    genomes with multiple replicons.
149    
150    <li>The genes within a genome are of two distinct types:
151    <ol>
152    <li>those that describe how to construct a protein (i.e., prtein-encoding genes), and
153    <li>those that describe how to construct a string of RNA (i.e., how to construct a string in the
154    4-character RNA alphabet {A,C,G,U}).
155    </ol>
156    <br><br>
157    <li>The location of a gene is generalized to be a set of regions within the genome (that are
158    concatenated to form the instructions needed to construct either a protein or a string of RNA).
159    <li>A protein is a character in an alphabet that now includes the 20 character codes from
160    the basic abstraction plus a very limited set of extra codes.
161    We already have cases in which <i>selenocyctein</i>  and <i>pyrrolysine</i> appear as nonstandard
162    translations of codons, and there may eventually be more.
163    
164    <li>Each protein-encoding gene has both a DNA sequence (by defintion) and a translation.  However,
165    the translation is not required to exactly match what a codon-by-codon translation of the DNA sequence
166    would produce.  This allows us to handle the very rare instances in which selenocystein occurs as the translatin
167    of TGA or pyrrolysine occurs as a translation of TAG (and others, if necessary).
168    </ul>
169    
170    This loosened up formulation represents a very minimal set of changes.  They should be left out of the
171    basic tutorial for computer scientists and mathematicians.
172    
173    <h2>The cell: Adding the Concepts Needed to Discuss Transcriptional Regulation</h2>
174    
175    In the final version of the abstraction, we add the minimal set of notions needed to support
176    analysis of transcriptional regulation.  An <b>operon</b> is a set of contiguous genes that are all
177    on the same strand and are all co-regulated.  We consider a gene that is not co-regulated with any adjacent genes
178    to be an operon composed of just itself.  A <b>binding site</b> is a small region of DNA (normally
179    occurring a short space ahead of an operon) that acts as a switch turning the operon "on" or "off". When
180    a specific protein or expressed RNA called a <b>transcriptional regulator</b> binds the site, it flips the switch.  One or more
181    specific transcriptional regulators can bind a specific site (i.e., sets of
182     sites are associated with each specific transcriptional regulator).  The effect of a regulator binding at a site
183    always has the same effect (either activating or deactivating the operon), but which effect depends on
184    the site-regulator pair.
185    
186    <h1>Part 1: Tutorial Notes</h1>
187    
188    <h2>Notes for The Basic Abstraction</h2>
189    
190    We will be speaking about organisms that are a single cell.  At some point life began on earth.
191    The single-celled organisms that we know of replicate producing copies of themselves that have
192    genomes which usually have very, very similar content to that of the parent cell.  <b>Evolution</b> is the
193    process in which cells replicate with some alterations in their genomes, are subjected to
194    <i>selective pressure</i>, and survive or not depending on many somewhat random factors.  The makeup of
195    cells (i.e., the genomes they contain and the machines that define what they are capable of doing)
196    changes gradually (and sometimes not so gradually) as time passes.
197    <p>
198    The original life forms that existed billions of years ago have evolved into three broad categories of
199    life forms.  That is, the evolutinary process led to early divisions, and these led to three main
200    categories of single-celled organisms.  We call these three forms the <b>archaea</b>,
201    the <b>bacteria</b>, and the <b>eukaryotes</b>.
202    A majority of the organisms for which we have acquired complete genomes are from the bacteria,
203    although the
204    numbers are rapidly growing for all three domains.
205    <p>
206    The minimal notion of a cell is enough to explain some of the basic
207  problems in bioinformatics:  problems in bioinformatics:
208    
209  <h3>Identify the genes within a genome</h3>  <h3>Identify the genes within a genome</h3>
210    
211  This problem simply involves taking a genome (a string of DNA) and locating  If we are to understand the contents of genomes, we will need to
212  the set of genes it contains.  Does the existence of 100s of genomes (genomes  locate the genes that occur in each genome.  This problem simply involves taking a genome (a
213  with at least some estimate of where the genes occur) effect how you might do this?  string of DNA) and locating the set of genes it contains.
214    In the case of bacteria and archaea, we know pretty well how to
215    locate the genes.
216    Once we
217    have identified instances from many genomes, it becomes possible to
218    recognize the genes in a new genome by just looking for things similar
219    to those we already understand.  The following problem is At the heart of reconizing when two
220    genes are "similar".
221    
222  <h3>Given two proteins. "align" them in a way that minimizes some edit function.  </h3>  <h3>Given two genes. "align" them in a way that minimizes some edit function.  </h3>
223    
224    For example, here is what you see when you align two genes from distinct organisms:
225    
 For example:  
 <br>  
 <br>  
226  <pre>  <pre>
227    
 seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT  
 seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE  
                                    ** *. :.:   .*: :**.:**..::***:*  :  :.  
228    
229  seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG  gene1           ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC
230  seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG  gene2           ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC
231                    **: :*.*   : :: .**:*:::* * *:* *  :: * .*:. *:.:: .******                     *   *  *    * * ***  ***    ** *  ****  * ***  **** * ***
232    
233  seq1            PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP  gene1           GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT
234  seq2            PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP  gene2           GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC
235                  *******.******::* ::    * * *.:*.*******:***:.* *: .::* :***                  *** **** **  ***     *** ** *  ****   *** ** *     * ***  *
236    
237  seq1            RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR  gene1           CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC
238  seq2            KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR  gene2           ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC
239                  :**:* :*** * *** ** :: :** ** ****** ** *:**:  * *.******::*                     ***** ***** *** ****  * ** **  *   ** ***      ****** ***
240    
241  seq1            LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND  gene1           ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG
242  seq2            FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND  gene2           ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG
243                  :*.*  *** * *** :  * : :*:.*******::****:.*.:****:*****.* **                  ********* **** ** ** **    ** *****  * ** ** ** ***   **   *
244    
245    gene1           GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC
246    gene2           GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC
247                    ** ** ***   ***********         **  ****  ***    ** *  * ***
248    
249    gene1           GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG
250    gene2           GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG
251                    ********   **     ***** ***** * * ** **            *   *****
252    
253    gene1           GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC
254    gene2           AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG
255                     *   **** * *********   ******* ***        * ***   ********
256    
257    gene1           GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA
258    gene2           GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA
259                    *  ********    ***   ***  ***     * ** ** * * **   ** * ****
260    </pre>
261    <hr>
262    
263    The sequences are recognizably similar, and in fact implement exactly the same function
264    in the two cells.  If we align the protein sequences corresponding to these two
265    genes, we get
266    
267    <pre>
268    gene1           MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN
269    gene2           -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN
270                      :*  * * :::*:* ****:**:. *:: * **:  *: *:***:*:***.:*  ***
271    
272    gene1           IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT
273    gene2           IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E
274                    ***:****.***:****:: *:* ****..  *.*::.:****.: ***: .:
275    
276    gene1           VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR
277    gene2           TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ
278                    .:***** **.:  :*.*** **::*:* * :**:.:*:
279  </pre>  </pre>
280    
281  shows an alignment of two proteins (called <i>seq1</i> and <i>seq2</i>).  There is a great deal of work relating to recognizing when two sequences are
282    similar and whether or not they had a common ancestor.  Understanding why
283    selective pressure conserves sections of sequences, but not others, will yield
284    important clues.  Can you reason out why some sections might be conserved, while
285    others vary wildly?
286    <p>
287    
288    Comparing sets of sequences that have retained the same function is
289    at the heart of understanding cellular machines and the proteins that implement them.
290    We find that looking at sets (often with more than two sequences) and aligning them
291    is important.
292    
293    
294  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>
295    
# Line 170  Line 366 
366    
367  <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>  <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>
368    
369  Here is one reasonable tree for the last 5 sequences.  Note that we now have alignments that  From the extant five sequences that are similar and displayed in the previous alignment, we can construct
370  contain thousands of sequences, and even displaying such trees is nontrivial.  a tree that depicts the "phylogenetic history" of the sequences.
371    Here is one reasonable tree for the last 5 sequences.
372    
373  <pre>  <pre>
374                       ,--------------------------------------------------- seq1                       ,--------------------------------------------------- seq1
375                       |                       |
# Line 183  Line 381 
381    |    |
382    |    |
383    |    |
384    |    ,----|
   |  
   |             ,-------------------------------- seq3  
385    |             |    |             |
386    |             |    |    |             ,-------------------------------- seq3
387    |-------------|    |    |             |
388      |    |             |
389      |    |-------------|
390    |             |    |             |
391    |             |    |             |
392    |             `------------------------------ seq4    |             `------------------------------ seq4
# Line 197  Line 395 
395    `---------------------------------------------- seq5    `---------------------------------------------- seq5
396  </pre>  </pre>
397    
398  This is an <i>unrooted tree</i>, since we have no idea just looking at extant  The tree suggests that at some point an ancestral
399  sequences about where the root should lie.  cell replicated.  One copy led (through a chain of descendants) to <b>seq5</b>, while the remaining sequences descend
400    from the ther copy.
401    <p>
402    Note that we now have alignments that
403    contain thousands of sequences, and even displaying such trees is nontrivial.
404    Because evolution plays such a central role in the phenomena we study, the construction of alignments
405    and trees in order to compare extant versions of proteins and gain insight into their historical origins
406    is considered basic to the task at hand.
407    
408  <h2>Some Random Facts that You Should Absorb</h2>  <h3>Some Random Facts that You Should Absorb</h2>
409    
410  Most genomes of bacteria contain between 400,000 and 12,000,000 characters.  Most genomes of bacteria contain between 400,000 and 12,000,000 characters.
411  Normally, the genes in a genome  Normally, the genes in a genome
# Line 215  Line 420 
420  <li>What is the average length of a gene?  <li>What is the average length of a gene?
421  </ul>  </ul>
422  <br>  <br>
423  It is worth spending just a short bit of time thinking about what types of cellular  It is worth spending just a short bit of time thinking about what types of
424  machines must exist.  Here are a few thoughts to start with  machines must exist in each cell.  Here are a few thoughts to start with
425  <ul>  <ul>
426  <li>  <li>
427  There must be one or more machines that support replication of the cell.  You would  There must be one or more machines that support replication of the cell.  You would
# Line 238  Line 443 
443  to react to it.  For example, many cells can "swim" towards food.  to react to it.  For example, many cells can "swim" towards food.
444  </ul>  </ul>
445  Those were just a few examples.  For any cell, we have many, many machines, and we still  Those were just a few examples.  For any cell, we have many, many machines, and we still
446  do not even understand what some of them do.  do not even understand what some of them do.  Later, we will try to offer a more structured
447    estimate of what is already known.
448  <p>  <p>
449  About 50-60% of the genes occur within 5000 characters of another gene such that  About 50-60% of the genes occur within 5000 characters of another gene such that
450  the two genes encode proteins that are part f the same cellular machine.  If you  the two genes encode proteins that are part of the same cellular machine.  This fact
451  had a genome in which the genes were identified, but the correspondence between the encoded  suggests that just having a large number of genomes would enable a person to group
452  proteins and cellular machines was completely unknown, what could you learn using this fact?  the genes into the machines they implement, without the person understanding the functions
453  Is the situation significantly different if you have 1000 genomes (let us say that  of the machines or the roles played by each protein.
 you know where the genes occur, but the correspondence between the proteins and cellular machines  
 is completely unknown in each case).  
454  <p>  <p>
455  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
456  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and
457  in most cells in which the proteins are not fused, the two distinct proteins are separate components  in most cells in which the proteins are not fused, the two distinct proteins are separate components
458  of a single machine.  How wuld you go about locating fused genes, and what could you learn from them?  of a single machine.  This, too, offers clues to support analysis of which proteins go with which machines.
459  <p>  <p>
460  Biologists have figured out the roles of about 50% of the genes.  That is, they can  Biologists have figured out the roles of about 50% of the genes.  That is, they can
461  place the gene in a cellular machine, they know what the machine does, and they know  place the gene in a cellular machine, they know what the machine does, and they know
462  the specific role of the gene in sustaining the functionality of the machine.  the specific role of the gene in sustaining the functionality of the machine.
463  <br><br>  <br><br>
464    
465  <h2>Imposing a Structure on Characterizing the Inventory</h2>  <h23Imposing a Structure on Characterizing the Inventory</h2>
466    
467  One central goal of bioinformatics is to support an accurate characterization of the cellular  One central goal of bioinformatics is to support an accurate characterization of the cellular
468  machinery for each cell.  It is of major importance to biologsts that we be able to support  machinery for each cell.  It is of major importance to biologsts that we be able to support
# Line 317  Line 521 
521  handled by fairly general procedures.  handled by fairly general procedures.
522  </ul>  </ul>
523    
524  <h2>Microarrays and States of the Cell</h2>  <h3>States of the Cell</h2>
525    
526    The notion of <i>subsystem</i> was introduced as an <i>abstract machine</i> -- that is, as an
527    attempt to create a framework for understanding variations within specific celular machines via
528    a form of comparative analysis.
529    
530    In any specific cell, sets of specific cellular machines are
531    switched on and off as units.  That is, they are <i>co-regulated</i>.  We will call such a set
532    of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
533    a single cellular machine).  A <b>state</b> of a cell will be defined
534    as the set of regulons that are operational at a point in time.  Thus, a state amounts to the set
535    of cellular machines that are operational at one instant.
536    <p>
537    If we think of a car as a bag of machines that interact to make it function, we might consider there
538    to be a huge number of states.  There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off.  However, we can divide the states of a car into major groupings based on the status
539    of some key "machines".  For example, "off" (the state in which the engine is turned off and the car is parked) and
540    "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into
541    two "major states".
542    <p>
543    Similarly, I believe that we should think about <i>major states of the cell</i> as being determined by the
544    functioning (or not) of a limited set of regulons.  The determination of these regulons, the major states,
545    and how transitions between are managed all are now parts of the picture being filed in.
546    
547    
548    <h3>Microarrays</h2>
549    
550    Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
551    cell.  Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
552    second list contains genes that were "active" in the second but not the first.  If a cellular
553    machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
554    only one cellular machine, then it would be reasonable to infer that you could say that the machine was
555    active in the first state, but not the second.  If one knew the regulons for a specific cell, it would go
556    a long way to suport extraction of insights from these microarrays.  On the other hand, if one had many,
557    many microarrays, and if the specific cellular machines for the cell are known, then one could make
558    substantial progress in uncovering the exact composition of the regulons that make up the cell.
559    
560    <h2>Notes for the Enhanced Abstraction</h2>
561    
562    The process of <b>expressing a gene</b> amounts to using the gene to produce the functional component of
563    a machine (a protein for a protein-encoding gene, and an RNA for an RNA-encoding gene).
564    The process of expressing a protein-encoding gene takes a gene (a string of DNA formed by concatenating a sequence of
565    regions from contigs) and producing a protein is normally thought of as taking place in two steps.
566    <b>Transcription</b> is the process of a specific machine moving along the contig and making a copy of the
567    gene as RNA.  This string of RNA is then <b>translated</b> by a separate machine.  The machine that performs
568    the copying of the gene into a string of RNA is called an <b>RNA polymerase</b>.  The machine to translate
569    the RNA into a protein, the <b>ribosome</b>, is made up of both proteins and RNA components.
570    <p>
571    Machines can be made up of both protein and RNA components, although most machines are built from
572    just proteins. Some of the most fundamental questions in biology relate to how life started and the steps
573    required to gradually enrich the basic machinery to the point where this magnificent information storage and
574    maintenance system based on DNA, RNA and proteins could have arisen.  There is much that can be inferred by
575    reasoning back from what we now observe and reasoning forward from the relatively little we know of
576    what the early earth was like. One possible set of goals would be to first understand in detail the inventory
577    of components we now see in life forms, composing something analogous to a CAD/CAM system describing life forms.
578    Then, as a second step, to understand the sequence of transformations that led from some initial raw components
579    to initial life forms to those we have seen and characterized.
580    <p>
581    The need to allow occasional "nonstandard" characters in protein sequences and a loosening of the corespondence
582    between a gene and characters in the protein sequence it can be used to build results from the fact that
583    evolution has produced the existing genetic codes and they continue to evolve (either converging or diverging
584    depending on the outcome of basically random processes operating under selective pressure).
585    <br>
586    
587    <h2>Notes on the Abstraction Extended to Support Regulation</h2>
588    
589    There are two basically different regulatory mechanisms in the cell.  In one, you have a metabolic
590    network in which fluxes are tightly controlled by positive and negative feeback loops. This <b>metabolic
591    regulation</b> occurs very rapidly.  <b>Transcriptional regulation</b> occurs orders of magnitude more
592    slowly.  It is just this transcriptional regulation that we consider in this extension.
593    <p>
594    
595    As the cell changes state, regulons are activated or de-activated by
596    transcriptional regulators (either protein or RNA) binding to specific
597    sites in the DNA.  This model has the redeeming characteristic of
598    simplicity.  It is certainly the case that there are innumerable
599    important issues that it disregards (e.g., regulation based on DNA
600    packaging, due to small RNAs binding the RNAs produced by
601    transcription, etc.).  In forming any clear notion of transcriptional
602    regulation and how it is achieved, we will need to carefully separate
603    these different mechanisms, since they have fundamentally different
604    modes of control and operation.  We are arguing that the notion of a
605    protein or RNA being used to flip regulons on and off by binding to
606    control sites within the genome is a major form of regulation and
607    probably the right place to start any effort to formulate a useful
608    abstraction.
609    
610    <h1>The Role of Bioinformatics in Supporting the Genomic Revolution</h1>
611    
612    Within the growing genomics revolution, one can easily divide developments and
613    goals into those relating to advances in medicine and agricultue from those relating to
614    pure science.  Here we consider only issues relating to pushing advances in basic research.
615    Here is an overview of our perspective:
616    <ol>
617    <li> The different life forms that now exist were produced by an evolutionary process,
618    which leads to our view that comparative analysis is the key to understanding.  Biological
619    machines that exist in complex forms will often also still exist in simpler forms (usually
620    in simpler organisms).
621    <li> Unravelling exactly how a machine works is more easily done in simpler organisms.  They
622    are easier to work with, and it is easier to gather the data needed to support comparative analysis.
623    
624    <li> This leads to the view that we should try to understand single-celled organisms to lay
625    the foundation for analysis of multicelluar organisms.
626    
627    <li> The characterization of unicellular life will require access to orders of magnitude
628    more data than exist now (we have more-or-less complete genomes for about 1000 genomes, but
629    that represents a small fraction of a percent of extant single-celled life forms).
630    
631    <li> The immediate basic steps that are taking place are roughly:
632    <ol>
633    <li> Attempt to formulate a growing list of abstract machines that correspond
634    to the many specific machines that implement te same goal.  These abstract machines (subsystems)
635    represent the basic units that make up life forms.
636    
637    <li> Create protein and RNA families in which the members are all homologous (share a common ancestor),
638    remain similar over almost all of the sequence, and all implement a common function.
639    
640  We wil think of a <b>regulon</b> as a set of subsystems.  A <b>state of the cell</b> is  <li> Build alignments for each protein family, along with phylogenetic trees that represent
641  defined as the set of regulons that are operational at a point in time.  an estimate of the history of how these specific sequences evolved.
642    
643    <li>Provide a computational framework to support continued maintenance and development of these
644    basic data types.
645    </ol>
646    
647    <li> A limited number of groups have progressed to the point where they can create models of
648    an organism that display predictive capabilities.  There are many forms of modeling.  In our view
649    it is important that we reach the state where we can routinely model states of the cell, transitions
650    between states, and metabolic characteristics of the cell.  We believe that it is now possible
651    to create fairly comprehensive representations of the metabolic networks of some bacteria.
652    In these cases, we have substantial amounts of physiological data, the number of abstract machines
653    in the cell is fairly limited, and it is possible to do compare the predictions against observed results.
654    
655    
656    </ol>
657    
658    
659    <br><br>
660    We do not actually know what machines are present in a cell.  We are in the midst of a grand
661    effort to clarify which are there and what they do.  Reaching a point where we have a near
662    complete overview of the basic inventory is arguably the highest priority at this point (we ignore
663    the medical revolution and numerous other wonderful advances, but...).
664    
665    The formulation of subsystems as abstract machines
666    in which each row of the subsystem describes a specific cellular machine that is believed to be present,
667    represents a way to maintain a collection of estimates or assertions.
668    <p>
669    A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
670    are similar over the entire lengths of the proteins.
671  <p>  <p>
672  A <b>consistent microarray</b> (for our purposes) is  We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
673    The computational tasks imposed by such a goal are obvious:
674    <ul>
675    <li>We need to consruct databases that implement at least the following entities:
676  <ol>  <ol>
677  <li> The ID of an experiment.  The experiment corresponds to two states of the cell, S1 and S2.  <li>cells (i.e., each cell must have an ID and a set of attributes),
678  <li> A list of proteins that are in in the regulons in S1, but not in those of S2.  <li>genomes,
679  <li> A list of proteins that are in the regulons of S2, but not in those of S1.  <li>genes,
680    <li>proteins,
681    <li>functional roles,
682    <li>subsystems, and
683    <li>protein families.
684  </ol>  </ol>
685  <br>  <li> We need to add support for developing clues to function by integrating data
686  A <b>real microarray</b> is just two sets of proteins.  We have some notion (e.g., an ID) for  from sources like proximity within the genome, fusions, etc.
687  each of two states of the cell, but no idea what regulons make up these states.  There is a  <li>We need to support a framework for the development of populated subsystems.
688  substantial error rate in the two lists of proteins (e.g., some of the proteins in the first list either were not in  <li>We need to construct decision procedures for membership in protein families.  Some
689  S1 or they were in S2).  of these procedures will be quite complex, although the majority of cases can be
690  <p>  handled by fairly general procedures.
691  The interesting research question is, given a large list of real microarrays, can you  </ul>
692  attach sets of regulons to all of the state IDs, and then give a minimal set of changes to the data  
 in the real microarrays needed to convert them to consistent microarrays.  
693    
694    <h1> The Role of Abstraction in Setting the Stage for Software Development and Modeling</h1>

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.4

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3