[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Annotation of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (view) (download) (as text)

1 : overbeek 1.5 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2 :     <html><head><title>Abstraction Working Document</title>
3 :    
4 :     </head>
5 :     <body>
6 :     <div align="center">
7 :     <h1>The Role of Bioinformatics in Interpretating Genomes of
8 :     Unicellular Organisms:</h1>
9 : overbeek 1.1 <h1>An Abstract View</h1>
10 : overbeek 1.5 <h2>by Ross Overbeek, ...</h2>
11 : overbeek 1.1 </div>
12 : overbeek 1.4 <h2>Introduction</h2>
13 : overbeek 1.5 This strange document began as a tutorial for computer scientists and
14 :     mathematicians. It was supposed
15 :     to somehow introduce them to the computational issues in genome
16 :     analysis.
17 :     It was requested by an instructor in a computer class. Overbeek in
18 :     attempting to respond to this request
19 :     formulated an abstraction that he began to believe had significance
20 :     beyond the tutorial.
21 :     <p>This document is a set of working notes relating to the
22 :     abstract. It is not organized properly as
23 :     an abstraction, a tutorial, or an essay on the role of bioinformatics
24 :     in support of biological research. It is,
25 :     however, organized properly as a working document that relates to all
26 :     of these goals.
27 :     </p>
28 :     <p>It begins with a development of the abstraction. This will be
29 :     suitable for mathematicians or computer scientists.
30 :     The abstraction is developed in four steps: the basic abstraction, the
31 :     enhanced abstraction needed to support
32 :     basic bioinformatics support for biologists, and finally the third step
33 :     which includes suport for the notion
34 :     of regulation. The intent throughout this discussion will be to seek a
35 :     minimal set of concepts needed to
36 :     effectively capture the essence of the required data. Unlike almost all
37 :     efforts to lay a foundation
38 :     for tutorials, software or research in biology, this effort focuses on
39 :     leaving out as much as possible.
40 :     While we do believe that there is an almost unlimited complexity that
41 :     can be introduced, and almost all of
42 :     it is needed for some specific goals, the vast majority of tools and
43 :     discussions require (we believe) relatively few
44 :     concepts. As they say, "the proof is in the pudding."
45 :     </p>
46 :     <p>The second section will feature a bit more tutorial comments.
47 :     It may well repeat much of what is in Part 1.
48 :     This part is offered as a way of easing a computer scientist of
49 :     mathematician into the issues that need to be
50 :     considered, if they wish to try to do useful research relating to the
51 :     genomics revolution. Eventually, this part
52 :     will be dramatically expanded by giving condensed summaries of the
53 :     machines of the cell broken into two broad
54 :     sets: the metabolic network and the cellular machinery not directly
55 :     included in the metabolic network. Loosely,
56 :     this separates what would be learned in a microbial biochemistry class
57 :     (when they exist) from what would
58 : overbeek 1.4 be learned in a course on molecular biology.
59 : overbeek 1.5 </p>
60 :     <p>The third part is an essay is an attempt to characterize our
61 :     view on </p>
62 : overbeek 1.4 <ul>
63 : overbeek 1.5 <li> what the main goals should be in current efforts to
64 :     advance biological knowledge via genome research,
65 :     </li>
66 :     <li> what role bioinformatics researchers have played in the
67 :     past, and
68 :     </li>
69 :     <li> what role they could productively play during the coming
70 :     few years.
71 :     </li>
72 : overbeek 1.4 </ul>
73 : overbeek 1.5 As such, it is undoubtedly an arrogant formulation by a group of
74 :     individuals with minimal background in
75 : overbeek 1.4 biology.
76 : overbeek 1.5 <p>The fourth section will focus on the imlications of the
77 :     abstractions in software development.
78 :     This is a bit of a radical proposal that makes sense to us (and is in
79 :     an area that we can
80 : overbeek 1.4 legitimately claim expertise).
81 : overbeek 1.5 </p>
82 : overbeek 1.4 <h1>Part 1: The Abstractions</h1>
83 :     <h2>The cell: a Minimal Perspective</h2>
84 : overbeek 1.5 A <b>cell</b> is a bag (i.e., a volume enclosed by a
85 :     membrane) that contains three types of things: compounds, cellular
86 :     machines, and a genome.
87 :     <p>By the term <b>compound</b> we refer to the
88 :     normal notion of chemical compound. </p>
89 :     <p>A <b>cellular machine</b> is a set of proteins
90 :     that together perform a function. Unless otherwise noted,
91 :     when we use the term <i>machine</i> we will always be
92 :     speaking of a cellular machine.
93 : overbeek 1.3 Many machines
94 : overbeek 1.5 transform one set of compounds into another set. Some machines
95 :     (transport machines) are used to move compounds into
96 :     or out of the cell. Later we will try to convey a more comprehensive
97 :     notion of what functions are implemented
98 : overbeek 1.3 by machines that we understand.
99 : overbeek 1.5 </p>
100 :     <p>A <b>protein</b> is a string of amino acids
101 :     (i.e., a string in the 20-character alphabet
102 :     {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
103 :     </p>
104 :     <p>A <b>genome</b> is a string of DNA bases (i.e., a
105 :     string in the 4-character alphabet {A,C,G,T}).
106 :     </p>
107 :     <p>A <b>gene</b> is a region in the genome that
108 :     describes how to build a
109 :     protein. The description is a sequence of 3-character codons. Each
110 : overbeek 1.1 codon corresponds to either a single amino acid or a stop codon.
111 : overbeek 1.5 There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
112 : overbeek 1.1 table of correspondences between codons and amino acids:
113 : overbeek 1.5 <br>
114 :     <br>
115 :     <table border="1">
116 :     <tbody>
117 :     <tr>
118 :     <th>Amino Acid</th>
119 :     <th>Codons</th>
120 :     </tr>
121 :     <tr>
122 :     <td>A</td>
123 :     <td>GCT, GCC, GCA, GCG </td>
124 :     </tr>
125 :     <tr>
126 :     <td>C</td>
127 :     <td>TGT, TGC</td>
128 :     </tr>
129 :     <tr>
130 :     <td>D</td>
131 :     <td>GAT, GAC</td>
132 :     </tr>
133 :     <tr>
134 :     <td>E</td>
135 :     <td>GAA, GAG</td>
136 :     </tr>
137 :     <tr>
138 :     <td>F</td>
139 :     <td>TTT, TTC</td>
140 :     </tr>
141 :     <tr>
142 :     <td>G</td>
143 :     <td>GGT, GGC, GGA, GGG</td>
144 :     </tr>
145 :     <tr>
146 :     <td>H</td>
147 :     <td>CAT, CAC</td>
148 :     </tr>
149 :     <tr>
150 :     <td>I</td>
151 :     <td>ATT, ATC, ATA</td>
152 :     </tr>
153 :     <tr>
154 :     <td>K</td>
155 :     <td>AAA, AAG</td>
156 :     </tr>
157 :     <tr>
158 :     <td>L</td>
159 :     <td>TTA, TTG, CTT, CTC, CTA, CTG</td>
160 :     </tr>
161 :     <tr>
162 :     <td>M</td>
163 :     <td>ATG</td>
164 :     </tr>
165 :     <tr>
166 :     <td>N</td>
167 :     <td>AAT, AAC</td>
168 :     </tr>
169 :     <tr>
170 :     <td>P</td>
171 :     <td>CCT, CCC, CCA, CCG</td>
172 :     </tr>
173 :     <tr>
174 :     <td>Q</td>
175 :     <td>CAA, CAG</td>
176 :     </tr>
177 :     <tr>
178 :     <td>R</td>
179 :     <td>CGT, CGC, CGA, CGG, AGA, AGG</td>
180 :     </tr>
181 :     <tr>
182 :     <td>S</td>
183 :     <td>TCT, TCC, TCA, TCG, AGT, AGC</td>
184 :     </tr>
185 :     <tr>
186 :     <td>T</td>
187 :     <td>ACT, ACC, ACA, ACG</td>
188 :     </tr>
189 :     <tr>
190 :     <td>V</td>
191 :     <td>GTT, GTC, GTA, GTG</td>
192 :     </tr>
193 :     <tr>
194 :     <td>W</td>
195 :     <td>TGG</td>
196 :     </tr>
197 :     <tr>
198 :     <td>Y</td>
199 :     <td>TAT, TAC</td>
200 :     </tr>
201 :     <tr>
202 :     <td>*</td>
203 :     <td>TAG, TGA, TAA [Stop codons]</td>
204 :     </tr>
205 :     </tbody>
206 : overbeek 1.1 </table>
207 : overbeek 1.5 <br>
208 :     <br>
209 :     </p>
210 :     <hr>The process of building a protein as a string of amino acids
211 :     from the gene containing codons is
212 : overbeek 1.4 called <b>expressing</b> the gene.
213 :     <br>
214 : overbeek 1.5 A <b>subsystem</b> (i.e., an abstract cellular machine) is
215 :     a set of functional roles.
216 :     Each protein implements one or more functional roles. The set of
217 :     functional roles
218 :     implemented by the protein is called the <b>function of the
219 :     protein</b>. The function of a multifunctional
220 :     protein that implements {functional-role-1,functional-role-2} is
221 :     normally written as
222 : overbeek 1.4 <i>functional-role-1 / functional-role-2</i>.
223 : overbeek 1.5 <br>
224 :     <br>
225 :     A <b>populated subsystem</b> is a subsystem with an
226 :     attached spreadsheet. Each column
227 :     in the spreadsheet corresponds to a functional role in the subsystem,
228 :     and each row corresponds to
229 :     a specific genome. Each cell in the spreadsheet contains the genes from
230 :     the corresponding genome
231 :     that implement the designated functional role (there may be 0 or more
232 :     such genes).
233 :     <br>
234 :     <br>
235 :     We do not actually know what machines are present in a cell. We are in
236 :     the midst of a grand
237 :     effort to clarify which are there and what they do. The formulation of
238 :     subsystems as abstract machines
239 :     in which each row of the subsystem describes a specific cellular
240 :     machine that is believed to be present,
241 : overbeek 1.4 represents a way to maintain a collection of estimates or assertions.
242 : overbeek 1.5 <p>A <b>protein family</b> is defined to be a set of
243 :     proteins that implement the same functional roles and
244 : overbeek 1.4 are similar over the entire lengths of the proteins.
245 : overbeek 1.5 </p>
246 :     <p>We seek a situation in which each protein occurs in one or
247 :     more subsystems and in a single protein family.
248 :     </p>
249 :     <p>In any specific cell, sets of specific cellular machines are
250 :     switched on and off as units. That is, they are <i>co-regulated</i>.
251 :     We will call such a set
252 :     of <i>co-regulated cellular machines</i> a <b>regulon</b>
253 :     (note that a regulon is often a set containing
254 :     a single cellular machine). A <b>state</b> of a cell will
255 :     be defined
256 :     as the set of regulons that are operational at a point in time. Thus, a
257 :     state amounts to the set
258 : overbeek 1.4 of cellular machines that are operational at one instant.
259 : overbeek 1.5 </p>
260 :     <p>Microarrays are, for a given genome, two lists of genes that
261 :     "changed expression levels" between two states of a
262 :     cell. Basicaly, the first list contains genes that were "active" during
263 :     the first state, but not the second; and the
264 :     second list contains genes that were "active" in the second but not the
265 :     first. If a cellular
266 :     machine utilizes protein <i>X</i>, and <i>X</i>
267 :     is in the first list, and if <i>X</i> is used in
268 :     only one cellular machine, then it would be reasonable to infer that
269 :     you could say that the machine was
270 : overbeek 1.4 active in the first state, but not the second.
271 : overbeek 1.5 </p>
272 :     <h2>The cell: the Enhanced Formlation Needed to Support
273 :     Bioinformatics</h2>
274 :     In the enhanced abstraction, we need to losen up some concepts. In
275 :     particular,
276 : overbeek 1.4 <ul>
277 : overbeek 1.5 <li> A <b>genome</b> is a set of strings in a
278 :     4-character alphabet. Each of the strings
279 :     is called a <b>contig</b>. Note that the concept as
280 :     formulated covers both incomplete genomes and genomes with multiple
281 :     replicons.
282 :     </li>
283 : overbeek 1.4 <li>The genes within a genome are of two distinct types:
284 :     <ol>
285 : overbeek 1.5 <li>those that describe how to construct a protein (i.e.,
286 :     prtein-encoding genes), and
287 :     </li>
288 :     <li>those that describe how to construct a string of RNA
289 :     (i.e., how to construct a string in the
290 : overbeek 1.4 4-character RNA alphabet {A,C,G,U}).
291 : overbeek 1.5 </li>
292 : overbeek 1.4 </ol>
293 : overbeek 1.5 <br>
294 :     <br>
295 :     </li>
296 :     <li>The location of a gene is generalized to be a set of
297 :     regions within the genome (that are
298 :     concatenated to form the instructions needed to construct either a
299 :     protein or a string of RNA).
300 :     </li>
301 :     <li>A protein is a character in an alphabet that now includes
302 :     the 20 character codes from
303 :     the basic abstraction plus a very limited set of extra codes. We
304 :     already have cases in which <i>selenocyctein</i> and <i>pyrrolysine</i>
305 :     appear as nonstandard translations of codons, and there may eventually
306 :     be more.
307 :     </li>
308 :     <li>Each protein-encoding gene has both a DNA sequence (by
309 :     defintion) and a translation. However,
310 :     the translation is not required to exactly match what a codon-by-codon
311 :     translation of the DNA sequence
312 :     would produce. This allows us to handle the very rare instances in
313 :     which selenocystein occurs as the translatin
314 :     of TGA or pyrrolysine occurs as a translation of TAG (and others, if
315 :     necessary).
316 :     </li>
317 : overbeek 1.4 </ul>
318 : overbeek 1.5 This loosened up formulation represents a very minimal set of changes.
319 :     They should be left out of the
320 : overbeek 1.4 basic tutorial for computer scientists and mathematicians.
321 : overbeek 1.5 <h2>The cell: Adding the Concepts Needed to Discuss
322 :     Transcriptional Regulation</h2>
323 :     In the final version of the abstraction, we add the minimal set of
324 :     notions needed to support
325 :     analysis of transcriptional regulation. An <b>operon</b>
326 :     is a set of contiguous genes that are all on the same strand and are
327 :     all co-regulated. We consider a gene that is not co-regulated with any
328 :     adjacent genes
329 :     to be an operon composed of just itself. A <b>binding site</b>
330 :     is a small region of DNA (normally
331 :     occurring a short space ahead of an operon) that acts as a switch
332 :     turning the operon "on" or "off". When
333 :     a specific protein or expressed RNA called a <b>transcriptional
334 :     regulator</b> binds the site, it flips the switch. One or more
335 :     specific transcriptional regulators can bind a specific site (i.e.,
336 :     sets of sites are associated with each specific transcriptional
337 :     regulator). The effect of a regulator binding at a site
338 :     always has the same effect (either activating or deactivating the
339 :     operon), but which effect depends on
340 : overbeek 1.4 the site-regulator pair.
341 :     <h1>Part 1: Tutorial Notes</h1>
342 :     <h2>Notes for The Basic Abstraction</h2>
343 : overbeek 1.5 We will be speaking about organisms that are a single cell. At some
344 :     point life began on earth.
345 :     The single-celled organisms that we know of replicate producing copies
346 :     of themselves that have
347 :     genomes which usually have very, very similar content to that of the
348 :     parent cell. <b>Evolution</b> is the
349 :     process in which cells replicate with some alterations in their
350 :     genomes, are subjected to
351 :     <i>selective pressure</i>, and survive or not depending on
352 :     many somewhat random factors. The makeup of
353 :     cells (i.e., the genomes they contain and the machines that define what
354 :     they are capable of doing)
355 : overbeek 1.3 changes gradually (and sometimes not so gradually) as time passes.
356 : overbeek 1.5 <p>The original life forms that existed billions of years ago
357 :     have evolved into three broad categories of
358 :     life forms. That is, the evolutinary process led to early divisions,
359 :     and these led to three main
360 :     categories of single-celled organisms. We call these three forms the <b>archaea</b>,
361 : overbeek 1.3 the <b>bacteria</b>, and the <b>eukaryotes</b>.
362 : overbeek 1.5 A majority of the organisms for which we have acquired complete genomes
363 :     are from the bacteria, although the
364 : overbeek 1.3 numbers are rapidly growing for all three domains.
365 : overbeek 1.5 </p>
366 :     <p>The minimal notion of a cell is enough to explain some of the
367 :     basic
368 : overbeek 1.1 problems in bioinformatics:
369 : overbeek 1.5 </p>
370 : overbeek 1.1 <h3>Identify the genes within a genome</h3>
371 : overbeek 1.3 If we are to understand the contents of genomes, we will need to
372 : overbeek 1.5 locate the genes that occur in each genome. This problem simply
373 :     involves taking a genome (a
374 :     string of DNA) and locating the set of genes it contains. In the case
375 :     of bacteria and archaea, we know pretty well how to
376 :     locate the genes. Once we
377 : overbeek 1.3 have identified instances from many genomes, it becomes possible to
378 :     recognize the genes in a new genome by just looking for things similar
379 : overbeek 1.5 to those we already understand. The following problem is At the heart
380 :     of reconizing when two
381 : overbeek 1.3 genes are "similar".
382 : overbeek 1.5 <h3>Given two genes. "align" them in a way that minimizes some
383 :     edit function. </h3>
384 :     For example, here is what you see when you align two genes from
385 :     distinct organisms:
386 :     <pre>gene1 ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC<br>gene2 ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC<br>* * * * * *** *** ** * **** * *** **** * ***<br>gene1 GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT<br>gene2 GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC<br>*** **** ** *** *** ** * **** *** ** * * *** * gene1 CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC<br>gene2 ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC<br>***** ***** *** **** * ** ** * ** *** ****** ***<br>gene1 ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG<br>gene2 ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG<br>********* **** ** ** ** ** ***** * ** ** ** *** ** *<br>gene1 GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC<br>gene2 GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC<br>** ** *** *********** ** **** *** ** * * ***<br>gene1 GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG<br>gene2 GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG<br>******** ** ***** ***** * * ** ** * *****<br>gene1 GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC<br>gene2 AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG<br>* **** * ********* ******* *** * *** ******** gene1 GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA<br>gene2 GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA<br>* ******** *** *** *** * ** ** * * ** ** * ****<br></pre>
387 : overbeek 1.3 <hr>
388 : overbeek 1.5 The sequences are recognizably similar, and in fact implement exactly
389 :     the same function
390 :     in the two cells. If we align the protein sequences corresponding to
391 :     these two
392 : overbeek 1.3 genes, we get
393 : overbeek 1.5 <pre>gene1 MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN<br>gene2 -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN<br> :* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***<br><br>gene1 IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT<br>gene2 IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E<br> ***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:<br><br>gene1 VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR<br>gene2 TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ<br> .:***** **.: :*.*** **::*:* * :**:.:*:<br></pre>
394 :     There is a great deal of work relating to recognizing when two
395 :     sequences are
396 :     similar and whether or not they had a common ancestor. Understanding
397 :     why
398 :     selective pressure conserves sections of sequences, but not others,
399 :     will yield
400 :     important clues. Can you reason out why some sections might be
401 :     conserved, while
402 : overbeek 1.3 others vary wildly?
403 : overbeek 1.5 <p>Comparing sets of sequences that have retained the same
404 :     function is
405 :     at the heart of understanding cellular machines and the proteins that
406 :     implement them. We find that looking at sets (often with more than two
407 :     sequences) and aligning them
408 : overbeek 1.3 is important.
409 : overbeek 1.5 </p>
410 :     <h3> Given a set of sequences, align them in a way that minimizes
411 :     some edit function.</h3>
412 : overbeek 1.1 Here is an example of a multiple sequence alignment:
413 :     <br>
414 :     <br>
415 : overbeek 1.5 <pre>CLUSTAL W (1.83) multiple sequence alignment<br><br><br>seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE<br>seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA<br>seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN<br>seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT<br>seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE<br> *. . . .: :.: **..: ** .* : :<br><br>seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL<br>seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL<br>seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL<br>seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL<br>seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL<br> * : : .: :. .. :.. : : .*: . *<br><br>seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI<br>seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI<br>seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI<br>seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV<br>seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV<br> ************.. :::..:: . * : : :: *******:*. . :.:<br><br>seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV<br>seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV<br>seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV<br>seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA<br>seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA<br> :::*:.: * :* : *: .: : * ** ** :** * * : * :.<br><br>seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI<br>seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM<br>seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI<br>seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI<br>seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI<br> **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::<br><br>seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA<br>seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA<br>seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA<br>seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------<br>seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------<br> *.* *: . : : . : ::: :**: ..*: * : *.<br><br>seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC<br>seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP<br>seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL<br>seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ<br>seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG<br> . : : :** * . * :<br><br>seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA<br>seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR<br>seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA<br>seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK<br>seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS<br> : * ****.** ::: :* * * : : .:<br><br>seq3 FVSQHGNRGKPL<br>seq4 FMSGHLGA----<br>seq5 FIEKKAL-----<br>seq1 LMMNHQ------<br>seq2 YLLGK-------<br> : :<br></pre>
416 :     <h3> Given a multiple sequence alignment, determine the most
417 :     likely evolutionary history of the sequences (i.e., construct a
418 :     phylogenetic tree).</h3>
419 :     From the extant five sequences that are similar and displayed in the
420 :     previous alignment, we can construct
421 : overbeek 1.3 a tree that depicts the "phylogenetic history" of the sequences.
422 : overbeek 1.5 Here is one reasonable tree for the last 5 sequences.
423 : overbeek 1.1 <pre>
424 : overbeek 1.5 ,--------------------------------------------------- seq1
425 :     |
426 :     |
427 :     ,------------------|
428 : overbeek 1.1 | |
429 :     | |
430 : overbeek 1.5 | `---------------------------------------------- seq2
431 :     |
432 :     |
433 :     |
434 :     |
435 :     |
436 :     | ,-------------------------------- seq3
437 :     | |
438 :     | |
439 :     |-------------|
440 :     | |
441 :     | |
442 :     | `------------------------------ seq4
443 : overbeek 1.1 |
444 :     |
445 :     `---------------------------------------------- seq5
446 :     </pre>
447 : overbeek 1.3 The tree suggests that at some point an ancestral
448 : overbeek 1.5 cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>,
449 :     while the remaining sequences descend from the ther copy.
450 :     <p>Note that we now have alignments that
451 :     contain thousands of sequences, and even displaying such trees is
452 :     nontrivial.
453 :     Because evolution plays such a central role in the phenomena we study,
454 :     the construction of alignments
455 :     and trees in order to compare extant versions of proteins and gain
456 :     insight into their historical origins
457 : overbeek 1.3 is considered basic to the task at hand.
458 : overbeek 1.5 </p>
459 :     <h3>Some Random Facts that You Should Absorb</h3>
460 :     Most genomes of bacteria contain between 400,000 and 12,000,000
461 :     characters. Normally, the genes in a genome
462 : overbeek 1.1 cover abut 90% of the genome.
463 : overbeek 1.5 Normally, there is about one gene per 1000 characters in a bacterial
464 :     genome.
465 :     <p>So, </p>
466 : overbeek 1.1 <ul>
467 : overbeek 1.5 <li> What is the length of the average protein sequence? </li>
468 : overbeek 1.1 <li>How many genes do these
469 : overbeek 1.5 genomes have? </li>
470 : overbeek 1.1 <li>What is the average length of a gene?
471 : overbeek 1.5 </li>
472 : overbeek 1.1 </ul>
473 :     <br>
474 : overbeek 1.5 It is worth spending just a short bit of time thinking about what types
475 :     of
476 :     machines must exist in each cell. Here are a few thoughts to start with
477 : overbeek 1.1 <ul>
478 :     <li>
479 : overbeek 1.5 There must be one or more machines that support replication of the
480 :     cell. You would
481 :     need something to copy the genome, and you would need something that
482 :     could build the DNA
483 :     bases that represent the characters (i.e., you will need machines to
484 :     build the molecules
485 :     corresponding to each of the four characters in the alphabet of DNA
486 :     bases.
487 :     </li>
488 :     <li>As we mentioned, you have transport machines that take
489 :     things into and out of the cell. Many
490 :     cells can import food in the form of sugar molecules. For example, many
491 :     cells can import
492 :     <i>glucose</i> a six-carbon compound. As the compound
493 :     gets broken down into smaller compounds,
494 :     energy is salvaged from the broken bonds to power the machines in the
495 :     cell. The smaller compounds
496 : overbeek 1.1 are used as building blocks for other needs.
497 : overbeek 1.5 </li>
498 :     <li>There must be one or more machines involved in building
499 :     proteins from the descriptions in te genes.
500 :     In particular, we will need a machine for each of the amino acids
501 :     (unless the cell can import some
502 : overbeek 1.1 of them).
503 : overbeek 1.5 </li>
504 :     <li>There must be mechanisms for sensing what is going on in
505 :     the environment and allowing the cell
506 :     to react to it. For example, many cells can "swim" towards food.
507 :     </li>
508 : overbeek 1.1 </ul>
509 : overbeek 1.5 Those were just a few examples. For any cell, we have many, many
510 :     machines, and we still
511 :     do not even understand what some of them do. Later, we will try to
512 :     offer a more structured
513 : overbeek 1.3 estimate of what is already known.
514 : overbeek 1.5 <p>About 50-60% of the genes occur within 5000 characters of
515 :     another gene such that
516 :     the two genes encode proteins that are part of the same cellular
517 :     machine. This fact suggests that just having a large number of genomes
518 :     would enable a person to group
519 :     the genes into the machines they implement, without the person
520 :     understanding the functions
521 : overbeek 1.3 of the machines or the roles played by each protein.
522 : overbeek 1.5 </p>
523 :     <p>Occasionally, proteins that are usually distinct in most cells
524 :     are fused into a single protein in
525 :     a few cells. In these cases, the fused gene is (by definition) part of
526 :     a single machine, and
527 :     in most cells in which the proteins are not fused, the two distinct
528 :     proteins are separate components
529 :     of a single machine. This, too, offers clues to support analysis of
530 :     which proteins go with which machines.
531 :     </p>
532 :     <p>Biologists have figured out the roles of about 50% of the
533 :     genes. That is, they can
534 :     place the gene in a cellular machine, they know what the machine does,
535 :     and they know
536 :     the specific role of the gene in sustaining the functionality of the
537 :     machine.
538 :     <br>
539 :     <br>
540 :     <h23imposing a="" structure="" on="" characterizing="" the="" inventory="">
541 :     One central goal of bioinformatics is to support an accurate
542 :     characterization of the cellular
543 :     machinery for each cell. It is of major importance to biologsts that we
544 :     be able to support
545 :     comparative analysis of cells. Perhaps, the most important aspect of
546 :     understanding cells relates to
547 :     their origin in an evolutionary process. Cells have a long evolutionary
548 :     history dating back billions of
549 :     years. The machines we see in cells today arose in the past, so we
550 :     expect to see many current cells
551 :     using machinery that resembles what turns up in other cells. When we
552 :     compare machines from different
553 :     cells they often look remarkably similar. On the other hand, those that
554 :     had a common origin in a cell that existed billions of years in the
555 :     past may now have versions that are not very similar. Modifications,
556 :     optimizations,
557 :     and insignificant alterations all combine to explore the space of
558 :     operational possibilities for
559 :     each type of machine. Hence, we need a framework for studying
560 :     similarities and differences in the
561 : overbeek 1.1 cellular machines and the proteins that implement them.
562 : overbeek 1.5 </h23imposing></p>
563 :     <p>Here is a short formulation of one way to do this:
564 :     <br>
565 :     <br>
566 :     </p>
567 : overbeek 1.1 <ul>
568 : overbeek 1.5 <li>A <b>subsystem</b> (i.e., an abstract cellular
569 :     machine) is a set of functional roles.
570 :     </li>
571 :     <li>Each protein implements one or more functional roles. The
572 :     set of functional roles
573 :     implemented by the protein is called the <b>function of the
574 :     protein</b>. The function of a multifunctional
575 :     protein that implements {functional-role-1,functional-role-2} is
576 :     normally written as
577 : overbeek 1.1 <i>functional-role-1 / functional-role-2</i>.
578 : overbeek 1.5 <br>
579 :     <br>
580 :     </li>
581 :     <li>A <b>populated subsystem</b> is a subsystem
582 :     with an attached spreadsheet. Each column
583 :     in the spreadsheet corresponds to a functional role in the subsystem,
584 :     and each row corresponds to
585 :     a specific genome. Each cell in the spreadsheet contains the genes from
586 :     the corresponding genome
587 :     that implement the designated functional role (there may be 0 or more
588 :     such genes).
589 :     </li>
590 :     </ul>
591 :     <br>
592 :     <br>
593 :     We do not actually know what machines are present in a cell. We are in
594 :     the midst of a grand
595 :     effort to clarify which are there and what they do. The formulation of
596 :     subsystems as abstract machines
597 :     in which each row of the subsystem describes a specific cellular
598 :     machine that is believed to be present,
599 : overbeek 1.1 represents a way to maintain a collection of estimates or assertions.
600 : overbeek 1.5 <p>A <b>protein family</b> is defined to be a set of
601 :     proteins that implement the same functional roles and
602 : overbeek 1.1 are similar over the entire lengths of the proteins.
603 : overbeek 1.5 </p>
604 :     <p>We seek a situation in which each protein occurs in one or
605 :     more subsystems and in a single protein family.
606 : overbeek 1.1 The computational tasks imposed by such a goal are obvious:
607 : overbeek 1.5 </p>
608 : overbeek 1.1 <ul>
609 : overbeek 1.5 <li>We need to consruct databases that implement at least the
610 :     following entities:
611 : overbeek 1.1 <ol>
612 : overbeek 1.5 <li>cells (i.e., each cell must have an ID and a set of
613 :     attributes),
614 :     </li>
615 : overbeek 1.1 <li>genomes,
616 : overbeek 1.5 </li>
617 : overbeek 1.1 <li>genes,
618 : overbeek 1.5 </li>
619 : overbeek 1.1 <li>proteins,
620 : overbeek 1.5 </li>
621 : overbeek 1.1 <li>functional roles,
622 : overbeek 1.5 </li>
623 : overbeek 1.1 <li>subsystems, and
624 : overbeek 1.5 </li>
625 : overbeek 1.1 <li>protein families.
626 : overbeek 1.5 </li>
627 : overbeek 1.1 </ol>
628 : overbeek 1.5 </li>
629 :     <li> We need to add support for developing clues to function by
630 :     integrating data
631 : overbeek 1.1 from sources like proximity within the genome, fusions, etc.
632 : overbeek 1.5 </li>
633 :     <li>We need to support a framework for the development of
634 :     populated subsystems.
635 :     </li>
636 :     <li>We need to construct decision procedures for membership in
637 :     protein families. Some of these procedures will be quite complex,
638 :     although the majority of cases can be
639 : overbeek 1.1 handled by fairly general procedures.
640 : overbeek 1.5 </li>
641 : overbeek 1.1 </ul>
642 : overbeek 1.5 <h3>States of the Cell</h3>
643 :     The notion of <i>subsystem</i> was introduced as an <i>abstract
644 :     machine</i> -- that is, as an
645 :     attempt to create a framework for understanding variations within
646 :     specific celular machines via
647 :     a form of comparative analysis. In any specific cell, sets of specific
648 :     cellular machines are switched on and off as units. That is, they are <i>co-regulated</i>.
649 :     We will call such a set
650 :     of <i>co-regulated cellular machines</i> a <b>regulon</b>
651 :     (note that a regulon is often a set containing
652 :     a single cellular machine). A <b>state</b> of a cell will
653 :     be defined
654 :     as the set of regulons that are operational at a point in time. Thus, a
655 :     state amounts to the set
656 : overbeek 1.2 of cellular machines that are operational at one instant.
657 : overbeek 1.5 <p>If we think of a car as a bag of machines that interact to
658 :     make it function, we might consider there
659 :     to be a huge number of states. There are many very minor "machines"
660 :     like the arm rest (or the radio, or the night light) that can be on or
661 :     off. However, we can divide the states of a car into major groupings
662 :     based on the status
663 :     of some key "machines". For example, "off" (the state in which the
664 :     engine is turned off and the car is parked) and
665 :     "on" (the engine is running and the car is moving) might be viewed as a
666 :     crude partitioning of the states into
667 : overbeek 1.2 two "major states".
668 : overbeek 1.5 </p>
669 :     <p>Similarly, I believe that we should think about <i>major
670 :     states of the cell</i> as being determined by the functioning (or
671 :     not) of a limited set of regulons. The determination of these regulons,
672 :     the major states,
673 :     and how transitions between are managed all are now parts of the
674 :     picture being filed in.
675 :     </p>
676 :     <h3>Microarrays</h3>
677 :     Microarrays are, for a given genome, two lists of genes that "changed
678 :     expression levels" between two states of a
679 :     cell. Basicaly, the first list contains genes that were "active" during
680 :     the first state, but not the second; and the
681 :     second list contains genes that were "active" in the second but not the
682 :     first. If a cellular
683 :     machine utilizes protein <i>X</i>, and <i>X</i>
684 :     is in the first list, and if <i>X</i> is used in
685 :     only one cellular machine, then it would be reasonable to infer that
686 :     you could say that the machine was
687 :     active in the first state, but not the second. If one knew the regulons
688 :     for a specific cell, it would go
689 :     a long way to suport extraction of insights from these microarrays. On
690 :     the other hand, if one had many,
691 :     many microarrays, and if the specific cellular machines for the cell
692 :     are known, then one could make
693 :     substantial progress in uncovering the exact composition of the
694 :     regulons that make up the cell.<br>
695 :     <br>
696 :     We are just now reaching the point where we do, in fact, have hundreds
697 :     of microarrays (each representing changes between two sampled states of
698 :     the cell). &nbsp;<br>
699 :     Let us reflect on how one might use this data to uncover the regulons
700 :     that are represented and how they relate to the major "states of the
701 :     cell".<br>
702 :     <br>
703 :     We might begin by trying to determine sets of genes from each subsystem
704 :     that appear to "move together". &nbsp; Actually, we want to arrive
705 :     at a set of genes that perform a well-defined function, some subset of
706 :     these almost always show up in the microarrays as "moving together".
707 :     &nbsp;Of these, if we have genes that occur only in a single
708 :     subsystem, then it would be reasonable as thinking of these as <span style="font-style: italic;">signatures</span> for set
709 :     of genes. &nbsp;The most natural way to do this would be to start
710 :     with metabolic subsystems, or even better <span style="font-style: italic;">scenarios (</span>discussed
711 :     below) which are subsets of functional roles from a metabolic subsystem
712 :     such that the subset if a connected set with well-defined inputs and
713 :     outputs. &nbsp;We wish then to define discovery of the regulon sets
714 :     associated with each condition as follows:<br>
715 :     <br>
716 :     <ol>
717 :     <li>&nbsp;First, for each scenario define&nbsp;</li>
718 :     <ul>
719 :     <li>the set of genes that are expected to show up in a
720 :     microarray when the scenario is activated or deactivated (call this
721 :     "the set of genes that move together" = <span style="font-style: italic;">SGMT for the scenario),</span></li>
722 :     <br>
723 :     <li>the subset of genes (perhaps empty) of the SGMT that are <span style="font-style: italic;">signatures</span> (call
724 :     this <span style="font-style: italic;">signatures of the
725 :     scenario)</span></li>
726 :     </ul>
727 :     <br>
728 :     <li>Then define the <span style="font-style: italic;">set
729 :     of regulons</span>. &nbsp;Each regulon is &nbsp;a set of
730 :     scenarios. &nbsp;There is a cost <span style="font-weight: bold;">cost_reg</span> associated
731 :     with the definition of each regulon. &nbsp;This prevents the
732 :     definition of numerous regulons all containing just one scenario.
733 :     &nbsp;If the penalty is set too high, only one regulon will be
734 :     defined. &nbsp;If it is set too low, then a large set of small
735 :     regulons results.</li>
736 :     <br>
737 :     <li>Finally, you need to define the set of regulons that were
738 :     activated for each microarray and the set that were deactivated.</li>
739 :     <br>
740 :     <li>Now, you compute a score for your decisions as&nbsp;<span style="font-weight: bold;">score = P - M - (cost_reg *
741 :     number_of_defined_regulons * number_of_microarrays)</span> where</li>
742 :     <br>
743 :     <ul>
744 :     <li><span style="font-weight: bold;">P</span>
745 :     = <span style="font-weight: bold;">p1 + p2,</span>
746 :     where&nbsp;<span style="font-weight: bold;"></span></li>
747 :     <br>
748 :     <ul>
749 :     <li><span style="font-weight: bold;">p1</span>
750 :     = <span style="font-weight: bold;">a1 * value_signature </span>and
751 :     <span style="font-weight: bold;">a1</span>
752 :     is the number of signatures of scenarios that moved as predicted, and <span style="font-weight: bold;">value_signature </span>is
753 :     the value associated with a signature moving in the direction predicted,</li>
754 :     <br>
755 :     <li><span style="font-weight: bold;">p2 = a2 *
756 :     value_SGMT_nonsig</span> and <span style="font-weight: bold;">a2</span>
757 :     is the number of SGMT genes that moved as predicted, and <span style="font-weight: bold;">value_SGMT_nonsig</span> is
758 :     the value associated with a non-signature SGMT gene moving in the
759 :     direction predicted, and</li>
760 :     <br>
761 :     </ul>
762 :     <li><span style="font-weight: bold;">M = m1 +
763 :     m2, where</span></li>
764 :     <br>
765 :     <ul>
766 :     <li><span style="font-weight: bold;">m1 = b1 *
767 :     value_signature</span> and <span style="font-weight: bold;">b1</span>
768 :     is the number of signatures of scenarios that did not move as
769 :     predicted, &nbsp;and</li>
770 :     <br>
771 :     <li><span style="font-weight: bold;">m2 = b2 *
772 :     value_SGMT_nonsig </span>and <span style="font-weight: bold;">b2</span>
773 :     is the number of SGMT genes that did not move as predicted.&nbsp;</li>
774 :     </ul>
775 :     <br>
776 :     The&nbsp;<span style="font-weight: bold;">score </span>reflects
777 :     how well your decisions in the first three steps match the data in the
778 :     microarrays. &nbsp;The object is to make the sets of decisions in
779 :     the first three steps in a way that maximizes the&nbsp;<span style="font-weight: bold;">score.<br>
780 :     <br>
781 :     </span><span style="font-weight: bold;"></span><br>
782 :     <span style="font-weight: bold;"></span><span style="font-weight: bold;"></span>
783 :     </ul>
784 :     </ol>
785 : overbeek 1.2
786 : overbeek 1.1
787 : overbeek 1.4 <h2>Notes for the Enhanced Abstraction</h2>
788 : overbeek 1.5 The process of <b>expressing a gene</b> amounts to using
789 :     the gene to produce the functional component of
790 :     a machine (a protein for a protein-encoding gene, and an RNA for an
791 :     RNA-encoding gene).
792 :     The process of expressing a protein-encoding gene takes a gene (a
793 :     string of DNA formed by concatenating a sequence of
794 :     regions from contigs) and producing a protein is normally thought of as
795 :     taking place in two steps.
796 :     <b>Transcription</b> is the process of a specific machine
797 :     moving along the contig and making a copy of the
798 :     gene as RNA. This string of RNA is then <b>translated</b>
799 :     by a separate machine. The machine that performs
800 :     the copying of the gene into a string of RNA is called an <b>RNA
801 :     polymerase</b>. The machine to translate
802 :     the RNA into a protein, the <b>ribosome</b>, is made up of
803 :     both proteins and RNA components.
804 :     <p>Machines can be made up of both protein and RNA components,
805 :     although most machines are built from
806 :     just proteins. Some of the most fundamental questions in biology relate
807 :     to how life started and the steps
808 :     required to gradually enrich the basic machinery to the point where
809 :     this magnificent information storage and
810 :     maintenance system based on DNA, RNA and proteins could have arisen.
811 :     There is much that can be inferred by
812 :     reasoning back from what we now observe and reasoning forward from the
813 :     relatively little we know of what the early earth was like. One
814 :     possible set of goals would be to first understand in detail the
815 :     inventory
816 :     of components we now see in life forms, composing something analogous
817 :     to a CAD/CAM system describing life forms.
818 :     Then, as a second step, to understand the sequence of transformations
819 :     that led from some initial raw components
820 : overbeek 1.4 to initial life forms to those we have seen and characterized.
821 : overbeek 1.5 </p>
822 :     <p>The need to allow occasional "nonstandard" characters in
823 :     protein sequences and a loosening of the corespondence
824 :     between a gene and characters in the protein sequence it can be used to
825 :     build results from the fact that
826 :     evolution has produced the existing genetic codes and they continue to
827 :     evolve (either converging or diverging
828 :     depending on the outcome of basically random processes operating under
829 :     selective pressure).
830 : overbeek 1.4 <br>
831 : overbeek 1.5 </p>
832 : overbeek 1.4 <h2>Notes on the Abstraction Extended to Support Regulation</h2>
833 : overbeek 1.5 There are two basically different regulatory mechanisms in the cell. In
834 :     one, you have a metabolic
835 :     network in which fluxes are tightly controlled by positive and negative
836 :     feeback loops. This <b>metabolic
837 :     regulation</b> occurs very rapidly. <b>Transcriptional
838 :     regulation</b> occurs orders of magnitude more slowly. It is just
839 :     this transcriptional regulation that we consider in this extension.
840 :     <p>As the cell changes state, regulons are activated or
841 :     de-activated by
842 : overbeek 1.4 transcriptional regulators (either protein or RNA) binding to specific
843 : overbeek 1.5 sites in the DNA. This model has the redeeming characteristic of
844 :     simplicity. It is certainly the case that there are innumerable
845 : overbeek 1.4 important issues that it disregards (e.g., regulation based on DNA
846 :     packaging, due to small RNAs binding the RNAs produced by
847 : overbeek 1.5 transcription, etc.). In forming any clear notion of transcriptional
848 : overbeek 1.4 regulation and how it is achieved, we will need to carefully separate
849 :     these different mechanisms, since they have fundamentally different
850 : overbeek 1.5 modes of control and operation. We are arguing that the notion of a
851 : overbeek 1.4 protein or RNA being used to flip regulons on and off by binding to
852 :     control sites within the genome is a major form of regulation and
853 :     probably the right place to start any effort to formulate a useful
854 :     abstraction.
855 : overbeek 1.5 </p>
856 :     <h1>The Role of Bioinformatics in Supporting the Genomic
857 :     Revolution</h1>
858 :     Within the growing genomics revolution, one can easily divide
859 :     developments and
860 :     goals into those relating to advances in medicine and agricultue from
861 :     those relating to
862 :     pure science. Here we consider only issues relating to pushing advances
863 :     in basic research.
864 : overbeek 1.4 Here is an overview of our perspective:
865 :     <ol>
866 : overbeek 1.5 <li> The different life forms that now exist were produced by
867 :     an evolutionary process,
868 :     which leads to our view that comparative analysis is the key to
869 :     understanding. Biological
870 :     machines that exist in complex forms will often also still exist in
871 :     simpler forms (usually
872 : overbeek 1.4 in simpler organisms).
873 : overbeek 1.5 </li>
874 :     <li> Unravelling exactly how a machine works is more easily
875 :     done in simpler organisms. They
876 :     are easier to work with, and it is easier to gather the data needed to
877 :     support comparative analysis.
878 :     </li>
879 :     <li> This leads to the view that we should try to understand
880 :     single-celled organisms to lay
881 : overbeek 1.4 the foundation for analysis of multicelluar organisms.
882 : overbeek 1.5 </li>
883 :     <li> The characterization of unicellular life will require
884 :     access to orders of magnitude
885 :     more data than exist now (we have more-or-less complete genomes for
886 :     about 1000 genomes, but
887 :     that represents a small fraction of a percent of extant single-celled
888 :     life forms).
889 :     </li>
890 :     <li> The immediate basic steps that are taking place are
891 :     roughly:
892 :     <br>
893 :     <br>
894 : overbeek 1.4 <ol>
895 : overbeek 1.5 <li> Attempt to formulate a growing list of abstract
896 :     machines that correspond
897 :     to the many specific machines that implement te same goal. These
898 :     abstract machines (subsystems)
899 : overbeek 1.4 represent the basic units that make up life forms.
900 : overbeek 1.5 </li>
901 :     <li> Create protein and RNA families in which the members
902 :     are all homologous (share a common ancestor),
903 :     remain similar over almost all of the sequence, and all implement a
904 :     common function.
905 :     </li>
906 :     <li> Build alignments for each protein family, along with
907 :     phylogenetic trees that represent
908 : overbeek 1.4 an estimate of the history of how these specific sequences evolved.
909 : overbeek 1.5 </li>
910 :     <li>Provide a computational framework to support continued
911 :     maintenance and development of these
912 :     basic data types.</li>
913 : overbeek 1.4 </ol>
914 : overbeek 1.5 <br>
915 :     Groups are now actively pursuing all of these goals. &nbsp;For
916 :     individuals wishing to build a research program, we suggest
917 :     collaborating with an existing group or moving to one of the newer
918 :     areas that are now emerging.
919 :     </li>
920 :     <br>
921 :     <li> A limited number of groups have progressed to the point
922 :     where they can create models of an organism that display predictive
923 :     capabilities. There are many forms of modeling. In our view
924 :     it is important that we reach the state where we can routinely model
925 :     states of the cell, transitions
926 :     between states, and metabolic characteristics of the cell. We believe
927 :     that it is now possible
928 :     to create fairly comprehensive representations of the metabolic
929 :     networks of some bacteria. In these cases, we have substantial amounts
930 :     of physiological data, the number of abstract machines
931 :     in the cell is fairly limited, and it is possible to do compare the
932 :     predictions against observed results. &nbsp; An effort has begun by
933 :     a
934 :     team within the SEED project, led by researchers from Hope Colege, to
935 :     develop a library of what they call&nbsp;<span style="font-style: italic;">scenarios</span>.
936 :     &nbsp;These scenarios capture the idea of a specific machine
937 :     implementing a metabolic transformation operating with well-defined
938 :     inputs and outputs. From a large and growing number of scenarios in
939 :     this library, they automatically reconstruct metabolic networks for
940 :     most of the bacteria for which genomes have been sequenced.
941 :     &nbsp;This
942 :     effort is seeting the stage for widespread whole genome metabolic
943 :     modeling.&nbsp;</li>
944 :     <li>Rapid progress has been made in our ability to
945 :     recognize regulatory binding sites and to use them with knowledge of
946 :     specific machines to create a consistent picture of regulons in some
947 :     bacteria. &nbsp;This technology has been gathering adherents over
948 :     the
949 :     last five years and we believe that it will play a significant role in
950 :     clarifying regulons, additions proteins that will be added to specific
951 :     machines, and a growing understanding of states of the cell.&nbsp;</li>
952 : overbeek 1.4 </ol>
953 : overbeek 1.5 Having said all that, is it possible to list some of the
954 :     important, high-payout bioinformatic questions that are worth
955 :     pondering? &nbsp;Here is a list for your consideration:<br>
956 :     <br>
957 :     <ol>
958 :     <li>The
959 :     definition of the location of genes&nbsp; for bacterialial genomes
960 :     needs cleaning up. &nbsp;The situation is made somewhat more
961 :     interesting by a growing use of sequencing technologies that produce
962 :     systematic errors leading to numerous frameshifts and poorly called
963 :     start locations. &nbsp;Fixing these would be a problem of modest
964 :     difficulty and very modest reward. &nbsp;The situation in
965 :     eukaryotic
966 :     genomes is quite different. &nbsp;The problem of defining the genes
967 :     in
968 :     a eukaryotic genome is still quite unsolved, &nbsp;We conjecture
969 :     that</li>
970 : overbeek 1.4 <ul>
971 : overbeek 1.5 <li>the
972 :     key to progress is the use of sets of genomes (i.e., solve the problem
973 :     of defining the genes in a set of closely-related genomes first), and</li>
974 :     <li>begin
975 :     with the single-celled eukaryotic genomes first. &nbsp;There are
976 :     many
977 :     types of single-celled eukaryotes, and some of them will undoubtedly
978 :     offer major challenges. &nbsp;However, existing experience suggests
979 :     that there will be numerous <span style="font-style: italic;">fungal</span>
980 :     genomes available (for example) and that focusing on these would be a
981 :     much easier task than trying to face plants, animals, etc.</li>
982 :     </ul>
983 :     <li>The
984 :     creation of populated subsystems is essentially a task for expert
985 :     biologists. &nbsp;However, the tools to support the task are a
986 :     reasonable focus for bioinformatic projects. &nbsp;The tools needed
987 :     to
988 :     delicately separate the roles of paralogous proteins have been
989 :     illustrated in the works of Jensen and Bonner, among others.
990 :     &nbsp;These tools relate to use of alignments, trees and motifs to
991 :     define the decision procedures needed to classify proteins into one of
992 :     several closely-related families.</li>
993 :     <li>The development &nbsp;of a
994 :     self-consistent set of protein families is a task closely related to
995 :     the one above. &nbsp;At this point in time there are several major
996 :     efforts currently building such protein families. &nbsp;The
997 :     development
998 :     of protocols for maintenance of the families, studying the evolutionary
999 :     history of related families, development of motifs that characterize
1000 :     specific families, and so forth all represent parts of a large
1001 :     classification problem.</li>
1002 :     <li>There are a class of tools that attempt to spot <span style="font-style: italic;">functional coupling</span>
1003 :     between specific proteins. &nbsp;Some are bioinformatic (like the
1004 :     chromosomal clustering and fusion phenomena briefly discussed above).
1005 :     &nbsp;Some are essentially experimental data (e.g., protein-protein
1006 :     interaction data or microarray data). &nbsp;The integration of
1007 :     evidence
1008 :     into a system capable of predicting whether &nbsp;or not two
1009 :     specific
1010 :     proteins are both components of a single machine has been attemtped,
1011 :     but much more remains to be done. &nbsp;The closely-related problem
1012 :     of
1013 :     determining whether or not two protein families are <span style="font-style: italic;">functionally coupled</span>
1014 :     (and precisely what that means) should be considered simultaneously.</li>
1015 :     <li>Defining
1016 :     regulons by gradually composing a consistent interpretation of
1017 :     subsystems, regulatory sites, and physiological data is a task that is
1018 :     semi-automated. &nbsp;Devlopment of a fully automated version seems
1019 :     too
1020 :     ambitious, but developing tools to increase the productivity of
1021 :     biologists developing these models of transcriptional regulation is
1022 :     certainly going to gain much more attention.</li>
1023 :     <li>Development of a meaningful notion of <span style="font-style: italic;">states of a cell</span>
1024 :     is a problem seems to us to have many of the characteristics one wants:
1025 :     &nbsp;it is a problem for which relevant data is starting to
1026 :     appear,
1027 :     many aspects of the needed infrastructure have only recently appeared,
1028 :     and the outcome may be of fundamental significance.</li>
1029 :     <li>To what
1030 :     extent is it possible to predict the protein families which have
1031 :     instances in a given cell given the closest 10 neighboring genomes and
1032 :     detailed information on the families they contain?</li>
1033 :     <li>Is it possible to think of a set of protein families as <span style="font-style: italic;">major predictors</span>
1034 :     that would allow you to infer the presence or absence of many other
1035 :     families.</li>
1036 : overbeek 1.4 </ol>
1037 : overbeek 1.5 <br>
1038 :     <ul>
1039 : overbeek 1.4 </ul>
1040 : overbeek 1.5 <h1> The Role of Abstraction in Setting the Stage for Software
1041 :     Development and Modeling</h1>
1042 :     In
1043 :     this section, we argue that the abstraction is much more than just a
1044 :     pedagogical aid. &nbsp;It will form the conceptual under-pinnings
1045 :     of
1046 :     the software needed to support work on the problems described in the
1047 :     last section (as well as numerous others that will become apparent as
1048 :     the revolution progresses).<br>
1049 :     </body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3