[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Annotation of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.6 - (view) (download) (as text)

1 : overbeek 1.5 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2 : overbeek 1.6 <html><head><title>Abstraction Working Document</title></head>
3 : overbeek 1.5 <body>
4 :     <div align="center">
5 : overbeek 1.6 <h1>Understanding Single-celled Life:</h1>
6 :     <h1>An Abstract Approach</h1>
7 :     <h2>by Ralph Butler, Ross Overbeek, ...</h2>
8 : overbeek 1.1 </div>
9 : overbeek 1.6 <br>
10 :     <h1>Part 1: The Cell: a Basic Abstraction</h1>
11 : overbeek 1.5 A <b>cell</b> is a bag (i.e., a volume enclosed by a
12 :     membrane) that contains three types of things: compounds, cellular
13 :     machines, and a genome.
14 :     <p>By the term <b>compound</b> we refer to the
15 :     normal notion of chemical compound. </p>
16 :     <p>A <b>cellular machine</b> is a set of proteins
17 :     that together perform a function. Unless otherwise noted,
18 :     when we use the term <i>machine</i> we will always be
19 :     speaking of a cellular machine.
20 : overbeek 1.3 Many machines
21 : overbeek 1.5 transform one set of compounds into another set. Some machines
22 :     (transport machines) are used to move compounds into
23 :     or out of the cell. Later we will try to convey a more comprehensive
24 :     notion of what functions are implemented
25 : overbeek 1.3 by machines that we understand.
26 : overbeek 1.5 </p>
27 :     <p>A <b>protein</b> is a string of amino acids
28 :     (i.e., a string in the 20-character alphabet
29 :     {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
30 :     </p>
31 :     <p>A <b>genome</b> is a string of DNA bases (i.e., a
32 :     string in the 4-character alphabet {A,C,G,T}).
33 :     </p>
34 :     <p>A <b>gene</b> is a region in the genome that
35 :     describes how to build a
36 :     protein. The description is a sequence of 3-character codons. Each
37 : overbeek 1.6 <span style="font-style: italic;">codon</span> may
38 :     be thought of as an
39 :     instruction specifying which amino acid should come next in the protein
40 :     the gene describes. &nbsp; Thus, if the protein described by the
41 :     gene
42 :     contains 100 amino acids, then the gene would be composed of 100 codons
43 :     (i.e., 300 DNA characters) followed by a codon that means "stop here"
44 :     (a <span style="font-style: italic;">stop codon</span>).&nbsp;
45 : overbeek 1.5 There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
46 : overbeek 1.1 table of correspondences between codons and amino acids:
47 : overbeek 1.5 <br>
48 :     <br>
49 :     <table border="1">
50 :     <tbody>
51 :     <tr>
52 :     <th>Amino Acid</th>
53 :     <th>Codons</th>
54 :     </tr>
55 :     <tr>
56 :     <td>A</td>
57 :     <td>GCT, GCC, GCA, GCG </td>
58 :     </tr>
59 :     <tr>
60 :     <td>C</td>
61 :     <td>TGT, TGC</td>
62 :     </tr>
63 :     <tr>
64 :     <td>D</td>
65 :     <td>GAT, GAC</td>
66 :     </tr>
67 :     <tr>
68 :     <td>E</td>
69 :     <td>GAA, GAG</td>
70 :     </tr>
71 :     <tr>
72 :     <td>F</td>
73 :     <td>TTT, TTC</td>
74 :     </tr>
75 :     <tr>
76 :     <td>G</td>
77 :     <td>GGT, GGC, GGA, GGG</td>
78 :     </tr>
79 :     <tr>
80 :     <td>H</td>
81 :     <td>CAT, CAC</td>
82 :     </tr>
83 :     <tr>
84 :     <td>I</td>
85 :     <td>ATT, ATC, ATA</td>
86 :     </tr>
87 :     <tr>
88 :     <td>K</td>
89 :     <td>AAA, AAG</td>
90 :     </tr>
91 :     <tr>
92 :     <td>L</td>
93 :     <td>TTA, TTG, CTT, CTC, CTA, CTG</td>
94 :     </tr>
95 :     <tr>
96 :     <td>M</td>
97 :     <td>ATG</td>
98 :     </tr>
99 :     <tr>
100 :     <td>N</td>
101 :     <td>AAT, AAC</td>
102 :     </tr>
103 :     <tr>
104 :     <td>P</td>
105 :     <td>CCT, CCC, CCA, CCG</td>
106 :     </tr>
107 :     <tr>
108 :     <td>Q</td>
109 :     <td>CAA, CAG</td>
110 :     </tr>
111 :     <tr>
112 :     <td>R</td>
113 :     <td>CGT, CGC, CGA, CGG, AGA, AGG</td>
114 :     </tr>
115 :     <tr>
116 :     <td>S</td>
117 :     <td>TCT, TCC, TCA, TCG, AGT, AGC</td>
118 :     </tr>
119 :     <tr>
120 :     <td>T</td>
121 :     <td>ACT, ACC, ACA, ACG</td>
122 :     </tr>
123 :     <tr>
124 :     <td>V</td>
125 :     <td>GTT, GTC, GTA, GTG</td>
126 :     </tr>
127 :     <tr>
128 :     <td>W</td>
129 :     <td>TGG</td>
130 :     </tr>
131 :     <tr>
132 :     <td>Y</td>
133 :     <td>TAT, TAC</td>
134 :     </tr>
135 :     <tr>
136 :     <td>*</td>
137 :     <td>TAG, TGA, TAA [Stop codons]</td>
138 :     </tr>
139 :     </tbody>
140 : overbeek 1.1 </table>
141 : overbeek 1.5 <br>
142 :     <br>
143 :     </p>
144 :     <hr>The process of building a protein as a string of amino acids
145 :     from the gene containing codons is
146 : overbeek 1.4 called <b>expressing</b> the gene.
147 :     <br>
148 : overbeek 1.6 <br>
149 :     <h2>Problems in BioInformatics that Depend only on the Basic
150 :     Abstraction</h2>
151 :     <h4>Identifying Genes within the Genome</h4>
152 :     If we plan on using a genome, it will usually be necessary to identify
153 :     the genes within the genome. &nbsp;How can this best be done?
154 :     &nbsp; First, it should be noted that this can be broken into three
155 :     variations:<br>
156 :     <br>
157 :     <ol>
158 :     <li>Given no assumption of an existing body of previously
159 :     identified genes, find the genes in a new genome.</li>
160 :     <li>Given a large collection of existing genomes in which the
161 :     genes have been identified, find the set of genes in a new genome.</li>
162 :     <li>Given a large set of existing genomes, discard any existing
163 :     decisions and try to identify genes in all of them from scratch.
164 :     </ol>
165 :    
166 :     When the first genome was sequenced, the first option was pretty much
167 :     the only reasonable choice (this is not completely true, since we had
168 :     many partial genomes that had already been sequenced and annotated).
169 :     People focused on developing reasonable strategies that would make the
170 :     best possible choices taking just the single genome as input.
171 :     <p>
172 :     Very quickly, the second alternative became more appropriate; it was
173 :     based on the idea of effectively exploiting the efforts that had been
174 :     expended in the early genomes to more quickly and accurately identify
175 :     the genes in each new genome.
176 :     <p>
177 :     It is worth noting that the second approach, while exploiting the
178 :     investments made in annotating the early genomes, also has the
179 :     property that early errors are frequently propagated. If an
180 :     algorithm had called a section of an early genome a gene when it actually
181 :     was not, then when we see something similar in a new genome it might
182 :     well get improperly labeled as well.
183 :     <p>
184 :     The third approach offers an unusal perspective and opportunity. It
185 :     suggests that we are entering an era in which we have many available
186 :     genomes, and that there might be approaches based on comparison that
187 :     would support more accurate annotations for the entire collection.
188 :     There may be many such approaches, but we will describe just one that
189 :     is based on ideas used in creating one of the early gene-calling
190 :     systems. Let us start by quoting the abstract from
191 :     <b>CRITICA: coding region identification tool invoking comparative
192 :     analysis.</b> by Jonathan Badger and Gary Olsen (Mol Biol Evol. 1999
193 :     Apr;16(4):512-24.PMID: 10331277):
194 :    
195 :     <blockquote>
196 :     "Gene recognition is essential to understanding existing and future
197 :     DNA sequence data. CRITICA (Coding Region Identification Tool
198 :     Invoking Comparative Analysis) is a suite of programs for identifying likely protein coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis,
199 :     regions of DNA are aligned with related sequences from the DNA
200 :     databases; if the translation of the aligned sequences has greater
201 :     amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from
202 :     the relative frequencies of hexanucleotides in coding-frames versus
203 :     other contexts (i.e., dicodon bias). The dicodon usage information
204 :     is derived by iterative analysis of the data so that CRITICA is not
205 :     dependent upon the existence or accuracy of coding sequence annotations in the databases. This independence makes the method
206 :     particularly well-suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium
207 :     DNA sequences. Its predictions were compared to the DNA sequence annotations and to the predictions of GenMark. CRITICA
208 :     proved more accurate than GenMark, and, moreover, many of its
209 :     predictions that would seem to be errors, instead reflect problems
210 :     in the sequence databases."
211 :     </blockquote>
212 :    
213 :     To understand the basic idea, we need to discuss how genomes are
214 :     passed on to descendants. We discuss the notion of replication below,
215 :     but for now let us just say that cells occasionally copy their genome
216 :     and divide into two cells, leaving a version of the genome in each
217 :     cell. The set of machines in the original cell also gets divided.
218 :     How the cell makes sure that each of the new cells gets enough
219 :     machines to make up an operational life-form is a separate topic. For
220 :     now, let us just say that they do achieve it. The new cell containing
221 :     a copy of the genome that existed in the original cell may very
222 :     occasionally contain a copied genome that differs from the original version due to errors in
223 :     copying. These differences are called <i>mutations</i>. If a
224 :     mutation occurred in a gene (encoding a protein), and if the mutation
225 :     caused the encoding to be changed to produce a protein sequence that
226 :     would not work, then the mutation is <i>lethal</i> and the cell dies
227 :     (whatever that means -- something close to "it does not function well
228 :     enough to compete for resources"). On the other hand, it may change
229 :     the encoding, but the new version is either just as good, or even
230 :     better. Many of the changes will simply change the DNA, but not the
231 :     protein it is used to generate (e.g., it might change <b>GGC</b> to <b>GGA</b>, both
232 :     of which are encoding of the amino acid <b>G</b>).
233 :     <p>
234 :     Most mutations that occur in protein-encoding genes are lethal (the
235 :     proteins have been optimized over many, many generations). The number
236 :     that improve the functioning of the encoded protein are relatively
237 :     few. This means that most mutations that alter which amino acid is
238 :     encoded do not appear in the sequenced genomes (cells with those
239 :     mutations often just die). A disproportionate number of mutations
240 :     will be of the category that leave the encoded sequence of amino acids
241 :     unchanged.
242 :     <p>
243 :     Let's make this all more concrete and you can try to tie a lot of
244 :     these notions together. Let us begin with a <b>multiple-sequence
245 :     alignment</b> of the starts of some genes from closely-related cells:
246 :     <pre>
247 :    
248 :     fig|198214.1.peg.4 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
249 :     fig|83333.1.peg.4 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
250 :     fig|331112.3.peg.3 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
251 :     fig|155864.1.peg.4 ATGAAACTCTACAATCTTAAAGATCACAATGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
252 :     fig|321314.4.peg.144 ATGAAACTCTATAATCTGAAAGACCATAATGAGCAGGTCAGCTTTGCGCAGGCCGTCACG
253 :     *********** ***** ***** ** ** ******************** ***** **
254 :    
255 :     fig|198214.1.peg.4 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
256 :     fig|83333.1.peg.4 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
257 :     fig|331112.3.peg.3 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCATGACCTGCCGGAATTCAGCCTG
258 :     fig|155864.1.peg.4 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
259 :     fig|321314.4.peg.144 CAAGGACTGGGCAAACAGCAGGGACTTTTTTTTCCGCACGAACTGCCGGAGTTTAGCCTG
260 :     ** ** ******** * ***** ** *********** ** ******** ** ******
261 :     </pre>
262 :    
263 :     We are depicting the initial 120 characters of the DNA encoding the
264 :     same corresponding protein from 5 distinct cells. We have associated
265 :     distinct identifiers to the 5 genes (e.g., fig|198214.1.peg.4). Each
266 :     of the genes beginning with <b>ATG</b> which is a codon encoding
267 :     <b>M</b>. The corresponding amino acid strings (that is, the starts
268 :     of the proteins encoded by the genes) are as follows:
269 :    
270 :    
271 :     <pre>
272 :     fig|198214.1.peg.4 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
273 :     fig|331112.3.peg.3 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
274 :     fig|83333.1.peg.4 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
275 :     fig|155864.1.peg.4 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
276 :     fig|321314.4.peg.144 MKLYNLKDHNEQVSFAQAVTQGLGKQQGLFFPHELPEFSL
277 :     ************************* ******* ******
278 :     </pre>
279 :    
280 :     Note that we have 120 DNA characters encoding 40 amino acids in each
281 :     of 5 closely-related genomes. Note that the fourth codon in the gene
282 :     (TAT in one genome, but TAC in the others) corresponds to the <b>Y</b>
283 :     in the fourth position of the amino acid alignment.
284 :     We highly recommend that you manually
285 :     go through the correspondence between the DNA and amino acid
286 :     sequences. Tabulate the number of mutations that did not alter the
287 :     amino acid sequences, as well as the number that did. Think about
288 :     what this means. It is critical.
289 :     <p>
290 :     What is important for you to realize is that the authors of CRITICA
291 :     had a pretty good idea: with just these five genomes you can rather
292 :     reliably recognize that these regions encode amino acid strings. If
293 :     we were to take the 30 characters ahead of the genes (usually called
294 :     <i>upstream</i> of the genes) along with the initial ATG we would get
295 :     the following alignment of those DNA sequences:
296 :    
297 :     <pre>
298 :    
299 :     fig|198214.1.peg.4 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
300 :     fig|331112.3.peg.3 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
301 :     fig|83333.1.peg.4 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
302 :     fig|155864.1.peg.4 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
303 :     fig|321314.4.peg.144 ACGGCGGGCGCACGAGTAGTGGGATAATCAATG
304 :     ****************** *** * * * ****
305 :     </pre>
306 :    
307 :     When we look at the generated amino acids, we see
308 :    
309 :     <pre>
310 :    
311 :     fig|198214.1.peg.4 TAGARVLENXM
312 :     fig|331112.3.peg.3 TAGARVLENXM
313 :     fig|83333.1.peg.4 TAGARVLENXM
314 :     fig|155864.1.peg.4 TAGARVLENXM
315 :     fig|321314.4.peg.144 TAGARVVGXSM
316 :     ******: *
317 :     </pre>
318 :    
319 :     Here we see some <b>X</b>s in the alignment; they represent <i>stop
320 :     codons</i> (i.e., they indicate that the codon does not encode an
321 :     amino acid). What is worth noting is that there are mutations in 5 of
322 :     the 30 upstream characters, and 4 of those 5 produced changes in the
323 :     encoded characters. It is a fact that most genes begin with
324 :     <b>ATG</b>, which makes it quite likely that this gene begins with the
325 :     exact <b>ATG</b> we have shown.
326 :     <p>
327 :     Now let us return to the topic of gene-calling. Our basic approach
328 :     will be as follows:
329 :     <ol>
330 :     <li>Begin by attempting to find as many genes as we can by taking the
331 :     existing set of genomes and finding protein-encoding sections using
332 :     the idea that was used in CRITICA. This will be computationally
333 :     expensive because it might require looking for similar regions in
334 :     thousands of genomes (remember there are 499,500 pairwise comparisons
335 :     to make for 1000 genomes, and there are thousands of similar genes for
336 :     almost all of the pairwise comparisons). However, what we will get
337 :     out is a pretty accurate estimate of which areas of each genome are
338 :     actually genes.
339 :     <li>A second step, after forming as many accurate predictions as we
340 :     can make, would involve polishing things up by taking the set of
341 :     predicted genes and trying to
342 : overbeek 1.4 <ul>
343 : overbeek 1.6 <li> make sure the details are consistent (e.g., that the start
344 :     positions of corresponding genes seem to be the same) and
345 :     <li> that we have not missed any distantly related, short genes.
346 :     </ul>
347 :     </ol>
348 :     A comprehensive recalling of genes should be done periodically,
349 :     leading to ever more reliable estimates for ever more thousands of
350 :     genomes. The whole topic of reducing the effort required to do the
351 :     incremental comparisons between genomes is obviously going to be
352 :     considered over the coming years. What is important, we suppose, is
353 :     that it is clear at this point that we can now accurately call genes
354 :     in prokaryotic genomes (although no one has yet gone back and cleaned
355 :     up all of the errors in the existing genomes).
356 :    
357 :    
358 :     <h4>Identifying Similar Genes</h4>
359 :     Genes are said to be <span style="font-style: italic;">homologous</span>
360 :     if they share a common ancestor. &nbsp;Tools have been developed to
361 :     construct estimates of whether or not two genes, or the protein
362 :     sequences they encode, are homologous. &nbsp;Most of these are
363 :     based on measuring the degree of <span style="font-style: italic;">similarity</span>
364 :     between the genes based on some metric. &nbsp;The most basic
365 :     versions of this problem are<br>
366 :     <br>
367 : overbeek 1.4 <ol>
368 : overbeek 1.6 <li>Given two genes (or proteins), are they homologs?
369 :     &nbsp;That is, estimate the liklihood that they are homologs.</li>
370 :     <li>Given a gene and a database of other genes, extract a
371 :     prioritized list from the database of genes that are likely to be
372 :     homologs. &nbsp;Similarly, given a protein sequence and a database
373 :     of other protein sequences, which are most likely to be produced by
374 :     homologous genes?</li>
375 :     <li>Produce an <span style="font-style: italic;">alignment</span>
376 :     of two DNA or protein sequences that attempts to show corresponding
377 :     characters in the two sequences. &nbsp; For example,<br>
378 : overbeek 1.5 </li>
379 : overbeek 1.4 </ol>
380 : overbeek 1.6 <pre>
381 :     fig|226900.1.peg.4136 ------------------ATGAGTAAAATTATCGGTATTGACTTAGGTAC
382 :     fig|138677.1.peg.499 ATGAGTGAACACAAAAAATCAAGCAAAATTATAGGTATAGACTTAGGCAC
383 :     ** ******** ***** ******** **
384 :    
385 :     fig|226900.1.peg.4136 AACAAACTCTTGTGTAGCTGTTATGGAAGGTGGAGAACCAAAGGTTATCC
386 :     fig|138677.1.peg.499 AACAAACTCCTGCGTATCTGTTATGGAAGGAGGACAAGCTAAAGTAATTA
387 :     ********* ** *** ************* *** ** * ** ** **
388 :    
389 :     fig|226900.1.peg.4136 CAAATCCAGAAGGGAACCGTACAACACCTTCTGTTGTAGCTTTCAAAAAT
390 :     fig|138677.1.peg.499 CATCATCCGAAGGAACAAGAACCACGCCATCGATCGTTGCCTTCAAAGGT
391 :     ** * ***** * * ** ** ** ** * ** ** ****** *
392 :    
393 :     fig|226900.1.peg.4136 GAAGAACGTCAAGTTGGGGAAGTTGCAAAGCGCCAAGCAATTACAAACCC
394 :     fig|138677.1.peg.499 AATGAGAAATTAGTGGGGATTCCAGCAAAACGTCAAGCAGTGACAAATCC
395 :     * ** *** *** ***** ** ****** * ***** **
396 :    
397 :     fig|226900.1.peg.4136 AAATACAA---TCATGTCTGTTAAACGTCATATGGG---TACAGACTACA
398 :     fig|138677.1.peg.499 AGAAAAAACTCTCGGCTCTACAAAACGCTTTATTGGCCGTAAGTACTCTG
399 :     * * * ** ** *** ***** *** ** ** ***
400 :    
401 :     fig|226900.1.peg.4136 AAGTAG--------------------------------------------
402 :     fig|138677.1.peg.499 AAGTAGCTTCGGAAATCCAAACCGTTCCTTATACAGTCACCTCCGGATCT
403 :     ******
404 :    
405 :     fig|226900.1.peg.4136 -------------------AAGTTGAAGGTAAAGATTATACACCTCAAGA
406 :     fig|138677.1.peg.499 AAAGGTGATGCCGTTTTCGAAGTTGATGGCAAACAATACACTCCAGAAGA
407 :     ******* ** *** * ** ** ** ****
408 :    
409 :     fig|226900.1.peg.4136 AATTTCTGCCATCATTTTACAAAACTTAAAAGCTTCTGCTGAAGCATACT
410 :     fig|138677.1.peg.499 AATTGGCGCACAAATCTTAATGAAAATGAAAGAGACAGCAGAAGCTTATC
411 :     **** ** ** *** ** * **** * ** ***** **
412 :    
413 :     fig|226900.1.peg.4136 TAGGTGAAACAGTAACGAAAGCTGTTATTACAGTACCTGCATACTTCAAC
414 :     fig|138677.1.peg.499 TAGGCGAAACTGTCACAGAAGCAGTGATCACCGTCCCCGCATACTTCAAT
415 :     **** ***** ** ** **** ** ** ** ** ** ***********
416 :    
417 :     fig|226900.1.peg.4136 GATGCAGAGCGTCAAGCAACGAAAGATGCTGGTCGTATCGCTGGTTTAGA
418 :     fig|138677.1.peg.499 GATTCTCAACGAGCATCCACAAAAGATGCTGGACGCATTGCAGGTCTAGA
419 :     *** * * ** * * ** *********** ** ** ** *** ****
420 :    
421 :     fig|226900.1.peg.4136 AGTTGAGCGTATCATTAACGAGCCAACAGCAGCAGCACTTGCTTACGGTT
422 :     fig|138677.1.peg.499 TGTAAAACGTATCATTCCAGAACCTACCGCAGCAGCTCTTGCCTACGGAA
423 :     ** * ********* ** ** ** ******** ***** *****
424 :    
425 :     fig|226900.1.peg.4136 TAGAAAAACAAGACGAAGAACAAAAAATCTTAGTATATGACTTAGGTGGC
426 :     fig|138677.1.peg.499 TCGATAA---AGTCGGTGATAAAAAAATCGCTGTCTTCGACCTTGGTGGA
427 :     * ** ** ** ** ** ******** ** * *** * *****
428 :     </pre>
429 :    
430 :     When two characters are in the same column, the implication is that we
431 :     believe that they derived from the same character in an ancestral
432 :     sequence. When a dash (i.e., a <b>-</b>) appears in a column, it indicates that we
433 :     believe that
434 :     <ul>
435 :     <li>the ancestral sequence had a character which corresponds to a
436 :     character in one of the sequences, but the other sequence lost a
437 :     characterin the evolutionary process, or
438 :     <li>
439 :     the ancestral sequence did not have a character in this position, but
440 :     a new one was inserted for one of the two sequences.
441 : overbeek 1.4 </ul>
442 : overbeek 1.6
443 :     <h4>Multiple-Sequence Alignment</h4>
444 :    
445 :     A multiple-sequence alignment extends the notion of a binary
446 :     alignment. We have already used them in discussing the problem of
447 :     identifying the genes in genomes, but they represent a fundamental
448 :     source of comparative insight and come into play in almost every
449 :     aspect of analyzing genomic sequences.
450 :    
451 :     Consider the following piece of a multiple-sequence alignment:
452 :    
453 :     <pre>
454 :     fig|226900.1.peg.4136 -------------------MSKIIGIDLGTTNSCVAVME-GGEPKVIPNP
455 :     fig|95665.5.peg.505 ----------------------------------MAVIE-NKKPIVLENP
456 :     fig|138677.1.peg.499 -------------MSEHKKSSKIIGIDLGTTNSCVSVME-GGQAKVITSS
457 :     fig|243274.1.peg.368 ---------------MAEKKEFVVGIDLGTTNSVIAWMKPDGTVEVIPNA
458 :     fig|349521.5.peg.4864 MIRKIAVFSFLRANRGFQSSMSLIGIDLGTTNSLIAHWG-EQGVEIIPNR
459 :     fig|397945.5.peg.3653 -----------------MEQKMIIGIDLGTTNSLVAAWK-DGRSVLIPNA
460 :     :: :: .
461 :    
462 :     fig|226900.1.peg.4136 EGNRTTPSVVAFK-NEERQVGEVAKRQAITNPN-TIMSVKRHMG------
463 :     fig|95665.5.peg.505 EGKRTVPSVVSFN-GDEVLVGDAAKRKQITNPN-TVSSIKRLMG------
464 :     fig|138677.1.peg.499 EGTRTTPSIVAFK-GNEKLVGIPAKRQAVTNPEKTLGSTKRFIGRKYSEV
465 :     fig|243274.1.peg.368 EGSRVTPSVVAFTKSGEILVGEPAKRQMILNPERTIKSIKRKMG------
466 :     fig|349521.5.peg.4864 LGARLTPSAVSLDADGAVIVGQAAKDRLVTHPDLSVASFKRRMG------
467 :     fig|397945.5.peg.3653 LGETLTPSCVSLDEDVTVLVGRAARERLQTHPDRTAANFKRYMG------
468 :     * .** *:: . ** *: : :*: : . ** :*
469 :    
470 :     fig|226900.1.peg.4136 ----------------TDYKVEVEGKDYTPQEISAIILQNLKASAEAYLG
471 :     fig|95665.5.peg.505 ----------------TKEKVTILNKEYTPEEISAKILSYIKDYAEKKLG
472 :     fig|138677.1.peg.499 ASEIQTVPYTVTSGSKGDAVFEVDGKQYTPEEIGAQILMKMKETAEAYLG
473 :     fig|243274.1.peg.368 ----------------TDYKVRIDDKEYTPQEISAFILKKLKNDAEAYLG
474 :     fig|349521.5.peg.4864 ----------------TNAAYTLGKQSFRPEELSALVLKQLKEDAEAYLN
475 :     fig|397945.5.peg.3653 ----------------SDRTVALAGRAFRPEELSSLVLRALKADAEAFLG
476 :     . : : : *:*:.: :* :* ** *.
477 :    
478 :     </pre>
479 :    
480 :     In actuality, these five sequences are part of a set of sequences that
481 :     are fairly similar, and
482 :     recognizably so. However, we believe that it is far from clear that
483 :     the alignment above is actually "correct" or "optimal" in a meaningful
484 :     sense. Rather, it seems probably close to correct, but containing
485 :     errors. Exactly where the dashess (called <i>indels</i>, since they
486 :     represent characters that were either inserted or deleted) should be
487 :     placed is uncertain.
488 :     <p>
489 :     There are two classes of problems associated with multiple-sequence
490 :     alignments:
491 :     <ol>
492 :     <li>how to compute them and
493 :     <li>how to use them.
494 :     </ol>
495 :     Some of the most interesting problems are of the second sort -- using
496 :     multiple-sequence alignments in what might be called <i>molecular
497 :     archaeology</i> to uncover events in the evolutionary history of the
498 :     sequences that occur in the alignment.
499 :    
500 :     On the other hand, one of the more important problems in bioinformatics
501 :     is, as we accumulate collections of thousands of
502 :     homologous sequences, the development of tools to support the
503 :     construction and use of these
504 :     multiple-sequence alignments.
505 :     <p>
506 :     Before we leave this topic, we will briefly describe a tool that we
507 :     believe any computer scientist could build easily and that would
508 :     reveal numerous research topics. Suppose that we have a single genome
509 :     that we wish to analyze, and that we have computed all regions of
510 :     similarity between sections of this genome and other complete genomes.
511 :     For each character in the genome we are focused on, we can easily
512 :     extract all regions in other genomes that are similar to regions in
513 :     the focus genome that contain the given character. Further, each of
514 :     the stored similarities (between a region in the given genome and one
515 :     of the other genomes) has an associated <i>percent identity</i> (a
516 :     measure of how similar the regions are - the percent of the aligned
517 :     characters that are identical). Now, the utility that is needed is
518 :     the ability to specify a region in the given genome, along with a
519 :     range of desired similarities, and then the program would display the
520 :     alignment composed of the selected similarity range (maybe with some
521 :     representation of the consensus and how conserved the values are).
522 : overbeek 1.1 <br>
523 : overbeek 1.6
524 : overbeek 1.5 <h3> Given a multiple sequence alignment, determine the most
525 :     likely evolutionary history of the sequences (i.e., construct a
526 :     phylogenetic tree).</h3>
527 : overbeek 1.6
528 :     Here is an example of a multiple sequence alignment:
529 :     <br>
530 :     <br>
531 :     <pre>
532 :     seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
533 :     seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
534 :     seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
535 :     seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
536 :     seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
537 :    
538 :     seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
539 :     seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
540 :     seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
541 :     seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
542 :     seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL
543 :    
544 :     seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
545 :     seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
546 :     seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
547 :     seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
548 :     seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV
549 :    
550 :     seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
551 :     seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
552 :     seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
553 :     seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
554 :     seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA
555 :    
556 :     seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
557 :     seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
558 :     seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
559 :     seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
560 :     seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI
561 :    
562 :     seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
563 :     seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
564 :     seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
565 :     seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
566 :     seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------
567 :    
568 :     seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
569 :     seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
570 :     seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
571 :     seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
572 :     seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG
573 :    
574 :     seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
575 :     seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
576 :     seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
577 :     seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
578 :     seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS
579 :    
580 :     seq3 FVSQHGNRGKPL
581 :     seq4 FMSGHLGA----
582 :     seq5 FIEKKAL-----
583 :     seq1 LMMNHQ------
584 :     seq2 YLLGK-------
585 :     </pre>
586 :    
587 : overbeek 1.5 From the extant five sequences that are similar and displayed in the
588 :     previous alignment, we can construct
589 : overbeek 1.3 a tree that depicts the "phylogenetic history" of the sequences.
590 : overbeek 1.5 Here is one reasonable tree for the last 5 sequences.
591 : overbeek 1.1 <pre>
592 : overbeek 1.6 ,----------------------- seq5
593 : overbeek 1.5 |
594 :     |
595 : overbeek 1.6 -|
596 : overbeek 1.5 |
597 : overbeek 1.1 |
598 : overbeek 1.6 | ,---------------------------- seq3
599 :     | |
600 :     | |
601 :     | ,-----------------|
602 :     | | |
603 :     | | |
604 :     | | `--------------------------- seq4
605 :     | |
606 :     | |
607 :     | ,--|
608 :     | | |
609 :     | | |
610 :     | | `---------------------------------------------- seq1
611 :     | |
612 :     | |
613 :     `--------------------|
614 :     |
615 :     |
616 :     `------------------------------------------------ seq2
617 : overbeek 1.1 </pre>
618 : overbeek 1.3 The tree suggests that at some point an ancestral
619 : overbeek 1.5 cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>,
620 : overbeek 1.6 while the remaining sequences descend from the other copy.
621 : overbeek 1.5 <p>Note that we now have alignments that
622 :     contain thousands of sequences, and even displaying such trees is
623 :     nontrivial.
624 :     Because evolution plays such a central role in the phenomena we study,
625 :     the construction of alignments
626 :     and trees in order to compare extant versions of proteins and gain
627 :     insight into their historical origins
628 : overbeek 1.3 is considered basic to the task at hand.
629 : overbeek 1.5 </p>
630 : overbeek 1.6 <h4>What is "the tree of life" and How Might it Get Built?</h4><br>The
631 :     problem of constructing a single phylogenetic tree from a single
632 :     alignment (the last problem) is relevant to this issue, but it does not
633 :     cover it. &nbsp;Suppose that you built 200 alignments &nbsp;that
634 :     contain the sequences common to almost all genomes. &nbsp;Then, if you
635 :     were to build 200 trees, and then you found that they were not
636 :     identical (or even close in some cases), what would you infer, and how
637 :     should you respond? &nbsp;Is it even possible or desirable that we
638 :     actually create an estimate of the history of how the existing
639 :     micro-organisms have evolved from some ancestral organism?<br><br>
640 :    
641 :     <h4>Assuming that We Do Have an Estimate of the Tree of Life, which Proteins Characterize Subdivisions of the Tree?</h4>It
642 :     is clear that sequences are introduced into genomes through replication
643 :     and (in addition) through horizontal transfer. &nbsp;In the presence of
644 :     large amounts of horizontal transfer, many genes will occur only in
645 :     relatively small portions of a specific subtree (these represent
646 :     relatively recent transfers). &nbsp;Is it possible and meaningful to
647 :     create inventories of proteins that tend to be unique to a subtree (or
648 :     is the concept "tend to be unique" somewhat similar to "a little
649 :     pregnant")?<br>
650 :    
651 :     <h4>Can We Identify Instances of Horizontal Transfer?</h4>How
652 :     can we construct tools to recognize horizontal transfer, and can these
653 :     tools be good enough to sort out the actual details of the evolutionary
654 :     history?<br><br>
655 :    
656 :     <h4>Can We Determine Which Columns and Sections of a Multiple-Sequence Alignment are Conserved (and Why)?</h4>Conservation
657 :     normally implies functional constraints (the reason a column has
658 :     restricted content is that any evolutionary change &nbsp;led to the
659 :     death of the organism that had it). &nbsp;Shifts of function relate to
660 :     conserved sections that have changed (i.e., the sections are not
661 :     random, but neither are they identical). &nbsp;The correspondence
662 :     between conservation and function is a rich source of significant
663 :     problems.<br><br>
664 :    
665 :     <h4>To What Extent Can Structure (Secondary or Tertiary) be Predicted froma Multiple-Sequence Alignment?</h4>Comparison
666 :     of columns in a large multiple sequence alignment was the key to
667 :     developing secondary structures for both DNA alignments and protein
668 :     alignments.<span style="font-family: monospace;"><br></span>
669 :     <h2>The Machines: a Initial Inventory</h2>
670 :    
671 :     <h3>Energy Issues</h3>
672 :     The following diagram offers a summary of the machines that relate to
673 :     acquisition and storage of energy, as well as the production of a
674 :     number of key compounds by breaking up sugar:<br>
675 : overbeek 1.1 <br>
676 : overbeek 1.5 <br>
677 : overbeek 1.6 <img style="width: 621px; height: 612px;" alt="" src="energy.jpg"><br>
678 : overbeek 1.5 <br>
679 : overbeek 1.6 &nbsp;&nbsp; &nbsp;
680 :     <table style="text-align: left; width: 411px; height: 156px;" border="1" cellpadding="2" cellspacing="2">
681 :     <tbody>
682 :     <tr>
683 :     <td>M1</td>
684 :     <td>harvesting light energy</td>
685 :     </tr>
686 :     <tr>
687 :     <td>M2</td>
688 :     <td>building sugar from smaller components and energy</td>
689 :     </tr>
690 :     <tr>
691 :     <td>M3</td>
692 :     <td>Storing strings of sugar molecules as starch</td>
693 :     </tr>
694 :     <tr>
695 :     <td>M4</td>
696 :     <td>breaking up starch to give sugar</td>
697 :     </tr>
698 :     <tr>
699 :     <td>M5</td>
700 :     <td>breaking up sugar to get energy and smaller molecules</td>
701 :     </tr>
702 :     </tbody>
703 :     </table>
704 : overbeek 1.5 <br>
705 : overbeek 1.6 Many of our machines will need energy to run. &nbsp;In the basic
706 :     organism we are describing, we have incuded <span style="font-weight: bold;">M1</span> to harvest energy
707 :     from sunlight. &nbsp;This process is called <span style="font-style: italic;">photosynthesis</span>.
708 :     &nbsp;The cell stores energy in a molecule called <span style="font-weight: bold;">ATP</span>.
709 :     &nbsp;Whenever energy is needed, the molecule is broken into two
710 :     pieces, releasing energy. &nbsp;The cell maintains a fairly
711 :     constant concentration of ATP, which allows reactions throughout the
712 :     cell to depend on it. &nbsp;This is similar in many respects to the
713 :     way electricity is available throught an house. &nbsp;Appliances
714 :     can be designed to plug in anywhere, and they assume the normal voltage
715 :     will be available. &nbsp;Similarly, we have a mechanism for
716 :     maintaining the concentration of ATP, and this allows us to include
717 :     reactions that depend on that concentration.<br>
718 :     <br>
719 :     <span style="font-weight: bold;">M2</span> is a
720 :     machine that builds sugar from CO2 and energy. &nbsp;This involves
721 :     a number of transformations. &nbsp;Eventually, we will need to
722 :     examine the individual steps, but for now let us remain at this quite
723 :     abstract level.<br>
724 :     <br>
725 :     Machines <span style="font-weight: bold;">M3</span>
726 :     and <span style="font-weight: bold;">M4 </span>&nbsp;allow
727 :     the cell to store sugars when energy is abundant, and then to use them
728 :     later when energy is needed. &nbsp;Starch should be thought of as
729 :     just a string of sugar molecules, which is a convenient way to store
730 :     them. &nbsp;When sugar is needed, <span style="font-weight: bold;">M4</span> can be used to
731 :     break off a few.<br>
732 :     <br>
733 :     Finally, <span style="font-weight: bold;">M5</span>
734 :     is a machine that takes sugar molecules and breaks them into smaller
735 :     pieces, releasing energy (in the form of ATP) in the process.
736 :     &nbsp;These smaller molecules are the building blocks that are used
737 :     &nbsp;over and over to build things needed by the
738 :     cell. &nbsp;Here is a table that contains the abbreviations we use
739 :     for these molecules. &nbsp;Frankly, if you have not had
740 :     biochemistry classes, you might simply work with the abbreviations,
741 :     since the full names can be intimidating.<br>
742 : overbeek 1.5 <br>
743 : overbeek 1.6 <table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
744 :     <tbody>
745 :     <tr>
746 :     <td>2OG</td>
747 :     <td>2-oxoglutarate</td>
748 :     </tr>
749 :     <tr>
750 :     <td>3PG</td>
751 :     <td>3-phospho-glutarate</td>
752 :     </tr>
753 :     <tr>
754 :     <td>A</td>
755 :     <td>Adenosine [one of the characters in a DNA string]</td>
756 :     </tr>
757 :     <tr>
758 :     <td>Ala</td>
759 :     <td>Alanine [an amino acid]</td>
760 :     </tr>
761 :     <tr>
762 :     <td>Arg</td>
763 :     <td>Arginine [an amino acid]</td>
764 :     </tr>
765 :     <tr>
766 :     <td>Asn</td>
767 :     <td>Asparagine [an amino acid]</td>
768 :     </tr>
769 :     <tr>
770 :     <td>Asp</td>
771 :     <td>Aspartate [an amino acid]</td>
772 :     </tr>
773 :     <tr>
774 :     <td>C</td>
775 :     <td>Cytosine [one of the characters in a DNA string]</td>
776 :     </tr>
777 :     <tr>
778 :     <td>CHOR</td>
779 :     <td>Chorismate</td>
780 :     </tr>
781 :     <tr>
782 :     <td>CO2</td>
783 :     <td>Carbon dioxide</td>
784 :     </tr>
785 :     <tr>
786 :     <td>Daughter genome</td>
787 :     <td>the added cell after replication</td>
788 :     </tr>
789 :     <tr>
790 :     <td>E4P</td>
791 :     <td>Erythrose 4-phosphate</td>
792 :     </tr>
793 :     <tr>
794 :     <td>Extra Membrane</td>
795 :     <td>A little extra membrane for the new cell</td>
796 :     </tr>
797 :     <tr>
798 :     <td>G</td>
799 :     <td>Guanine [one of the characters in a DNA string]</td>
800 :     </tr>
801 :     <tr>
802 :     <td>G6P</td>
803 :     <td>Glucose 6-phosphate</td>
804 :     </tr>
805 :     <tr>
806 :     <td>Genome</td>
807 :     <td>the DNA string in the cell that contais the genes</td>
808 :     </tr>
809 :     <tr>
810 :     <td>Gln</td>
811 :     <td>Glutamine [an amino acid]</td>
812 :     </tr>
813 :     <tr>
814 :     <td>Glu</td>
815 :     <td>Glutamate [an amino acid]</td>
816 :     </tr>
817 :     <tr>
818 :     <td>Gly</td>
819 :     <td>Glycine [an amino acid]</td>
820 :     </tr>
821 :     <tr>
822 :     <td>HOM</td>
823 :     <td>Homoserine</td>
824 :     </tr>
825 :     <tr>
826 :     <td>His</td>
827 :     <td>Histidine [an amino acid]</td>
828 :     </tr>
829 :     <tr>
830 :     <td>Iso</td>
831 :     <td>Isoleucine [an amino acid]</td>
832 :     </tr>
833 :     <tr>
834 :     <td>Leu</td>
835 :     <td>Leucine [an amino acid]</td>
836 :     </tr>
837 :     <tr>
838 :     <td>Lys</td>
839 :     <td>Lysine [an amino acid]</td>
840 :     </tr>
841 :     <tr>
842 :     <td>Membrane</td>
843 :     <td>the thing enclosing the cell</td>
844 :     </tr>
845 :     <tr>
846 :     <td>Met</td>
847 :     <td>Methionine [an amino acid]</td>
848 :     </tr>
849 :     <tr>
850 :     <td>OXLA</td>
851 :     <td>Oxalacetate</td>
852 :     </tr>
853 :     <tr>
854 :     <td>PEP</td>
855 :     <td>Phosphoenolpyruvate</td>
856 :     </tr>
857 :     <tr>
858 :     <td>PYR</td>
859 :     <td>Pyruvate</td>
860 :     </tr>
861 :     <tr>
862 :     <td>Phe</td>
863 :     <td>Phenylalanine [an amino acid]</td>
864 :     </tr>
865 :     <tr>
866 :     <td>Pro</td>
867 :     <td>Proline [an amino acid]</td>
868 :     </tr>
869 :     <tr>
870 :     <td>R5P</td>
871 :     <td>Ribose 5-phosphate</td>
872 :     </tr>
873 :     <tr>
874 :     <td>Ser</td>
875 :     <td>Serine [an amino acid]</td>
876 :     </tr>
877 :     <tr>
878 :     <td>Starch</td>
879 :     <td>A polymer of sugars (used for storage)</td>
880 :     </tr>
881 :     <tr>
882 :     <td>Sugar</td>
883 :     <td>think glucose</td>
884 :     </tr>
885 :     <tr>
886 :     <td>T</td>
887 :     <td>Thiamine [one of the characters in a DNA string]</td>
888 :     </tr>
889 :     <tr>
890 :     <td>Thr</td>
891 :     <td>Threonine [an amino acid]</td>
892 :     </tr>
893 :     <tr>
894 :     <td>Trp</td>
895 :     <td>Tryptophane [an amino acid]</td>
896 :     </tr>
897 :     <tr>
898 :     <td>Tyr</td>
899 :     <td>Tyrosine [an amino acid]</td>
900 :     </tr>
901 :     <tr>
902 :     <td>Val</td>
903 :     <td>Valine [an amino acid]</td>
904 :     </tr>
905 :     </tbody>
906 :     </table>
907 : overbeek 1.5 <br>
908 :     <br>
909 : overbeek 1.6 <h3>Building the Amino Acids</h3>
910 :     <img style="width: 576px; height: 529px;" alt="" src="AA1.jpg"><br>
911 :     <table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
912 :     <tbody>
913 :     <tr>
914 :     <td>M6</td>
915 :     <td>build glutamate and glutamine &nbsp;from
916 :     2-oxoglutarate</td>
917 :     </tr>
918 :     <tr>
919 :     <td>M7</td>
920 :     <td>build proline from glutamate and ATP</td>
921 :     </tr>
922 :     <tr>
923 :     <td>M8</td>
924 :     <td>build aspartate from 2-oxalacetate</td>
925 :     </tr>
926 :     <tr>
927 :     <td>M9</td>
928 :     <td>build arginine from glutamate, aspartate, and ATP</td>
929 :     </tr>
930 :     <tr>
931 :     <td>M10</td>
932 :     <td>build asparagine from glutamine, aspartate, and ATP</td>
933 :     </tr>
934 :     <tr>
935 :     <td>M11</td>
936 :     <td>build serine from 3-phospho-glutarate and glutamate</td>
937 :     </tr>
938 :     </tbody>
939 :     </table>
940 : overbeek 1.5 <br>
941 : overbeek 1.6 <img style="width: 512px; height: 665px;" alt="" src="AA2.jpg"><br>
942 : overbeek 1.5 <br>
943 : overbeek 1.6 <table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
944 :     <tbody>
945 :     <tr>
946 :     <td align="undefined" valign="undefined">M12</td>
947 :     <td align="undefined" valign="undefined">build
948 :     glycine from serine</td>
949 :     </tr>
950 :     <tr>
951 :     <td align="undefined" valign="undefined">M13</td>
952 :     <td align="undefined" valign="undefined">build
953 :     cysteine from serine</td>
954 :     </tr>
955 :     <tr>
956 :     <td align="undefined" valign="undefined">M14</td>
957 :     <td align="undefined" valign="undefined">build
958 :     methionine from homoserine and cysteine</td>
959 :     </tr>
960 :     <tr>
961 :     <td align="undefined" valign="undefined">M15</td>
962 :     <td align="undefined" valign="undefined">build lysine from pyruvate and aspartate</td>
963 :     </tr>
964 :     <tr>
965 :     <td align="undefined" valign="undefined">M16</td>
966 :     <td align="undefined" valign="undefined">buil
967 :     homoserine from aspartate</td>
968 :     </tr>
969 :     <tr>
970 :     <td align="undefined" valign="undefined">M17</td>
971 :     <td align="undefined" valign="undefined">build threonine from homoserine and ATP</td>
972 :     </tr>
973 :     <tr>
974 :     <td align="undefined" valign="undefined">M18</td>
975 :     <td align="undefined" valign="undefined">build isoleucine from glutamate, threonine and pyruvate</td>
976 :     </tr>
977 :     </tbody>
978 :     </table>
979 : overbeek 1.5 <br>
980 : overbeek 1.6 <img style="width: 563px; height: 651px;" alt="" src="AA3.jpg"><br>
981 : overbeek 1.5 <br>
982 : overbeek 1.6 <table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
983 :     <tbody>
984 :     <tr>
985 :     <td align="undefined" valign="undefined">M19</td>
986 :     <td align="undefined" valign="undefined">build alanine from pyruvate</td>
987 :     </tr>
988 :     <tr>
989 :     <td align="undefined" valign="undefined">M20</td>
990 :     <td align="undefined" valign="undefined">build valine from pyruvate</td>
991 :     </tr>
992 :     <tr>
993 :     <td align="undefined" valign="undefined">M21</td>
994 :     <td align="undefined" valign="undefined">Build leucine from pyruvate</td>
995 :     </tr>
996 :     <tr>
997 :     <td align="undefined" valign="undefined">M22</td>
998 :     <td align="undefined" valign="undefined">build the intermediate &nbsp;chorismate from phosphoenolpyruvate and erythrose 4-phosphate</td>
999 :     </tr>
1000 :     <tr>
1001 :     <td align="undefined" valign="undefined">M23</td>
1002 :     <td align="undefined" valign="undefined">build tyrosine and phenaylalanine from glutamate and chorismate</td>
1003 :     </tr>
1004 :     <tr>
1005 :     <td align="undefined" valign="undefined">M24</td>
1006 :     <td align="undefined" valign="undefined">build tryptophane from chorismate and glutamine</td>
1007 :     </tr>
1008 :     <tr>
1009 :     <td align="undefined" valign="undefined">M25</td>
1010 :     <td align="undefined" valign="undefined">build ribose 5-phosphate from glucose-6-phosphate</td>
1011 :     </tr>
1012 :     <tr>
1013 :     <td align="undefined" valign="undefined">M26</td>
1014 :     <td align="undefined" valign="undefined">build histidine from ribose-5-phosphate and ATP</td>
1015 :     </tr>
1016 :     </tbody>
1017 :     </table>
1018 : overbeek 1.5 <br>
1019 : overbeek 1.6 <h3>Expressing Genes</h3>
1020 :     <img style="width: 374px; height: 430px;" alt="" src="./expression.jpg"><br>
1021 : overbeek 1.5 <br>
1022 : overbeek 1.6 <table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
1023 :     <tbody>
1024 :     <tr>
1025 :     <td align="undefined" valign="undefined">M30</td>
1026 :     <td align="undefined" valign="undefined">building
1027 :     a protein from amino acids and a gene</td>
1028 :     </tr>
1029 :     </tbody>
1030 :     </table>
1031 : overbeek 1.5 <br>
1032 : overbeek 1.6 <b>M30</b> is a complex machine that we have not represented all that
1033 :     well. It exists in the cell, and you might imagine the cell as
1034 :     containing free-floating amino acids (which are built by the machines
1035 :     discussed above). <b>M30</b> can take the description of a protein
1036 :     encoded in a gene and build the protein from the instructions and the
1037 :     free-floating amino acids. It is certainly a complex and incredible
1038 :     machine, and it exists as a central component of the life forms we are
1039 :     studying.
1040 :    
1041 :     <h3>Motility</h3>The cell we envision has some motility. &nbsp;It can
1042 :     "turn on its motor and propellers" to move a bit, turn off the motility
1043 :     machinery, wait a while, turn it on again, and so forth.<br>We do not show a diagram or table of this machine, but we shall number it <span style="font-weight: bold;">M31</span>.<h3>Replication</h3>
1044 :     <br>
1045 :     Replication is descriibed in a somewhat imprecise manner. &nbsp;We think of <span style="font-weight: bold;">M27</span> as a machine that builds the <span style="font-weight: bold;">nucleotides</span>, which are the characters that make up the DNA genome. &nbsp; Then <span style="font-weight: bold;">M28</span>
1046 :     is a machine that takes these loose "characters" floating in the cell,
1047 :     along with the existing genomes, and manufactures a copy of the genome.
1048 :     &nbsp; Then, finally, <span style="font-weight: bold;">M29</span> takes some extra membrane (see the output of <span style="font-weight: bold;">M5</span>),
1049 :     the genome copy, and "pinches" the extended cell, creating two separate
1050 :     cells which we call the "original" (containing the original genome) and
1051 :     the "daughter" containing the copiy of the genome).<br>
1052 :     <h2><img style="width: 503px; height: 674px;" alt="" src="./replication.jpg">&nbsp;</h2>
1053 : overbeek 1.5 <br>
1054 :     <br>
1055 : overbeek 1.6 <table style="text-align: left;" border="1" cellpadding="2" cellspacing="2">
1056 :     <tbody>
1057 :     <tr>
1058 :     <td align="undefined" valign="undefined">M27</td>
1059 :     <td align="undefined" valign="undefined">build
1060 :     nucleotides</td>
1061 :     </tr>
1062 :     <tr>
1063 :     <td align="undefined" valign="undefined">M28</td>
1064 :     <td align="undefined" valign="undefined">build
1065 :     new genome</td>
1066 :     </tr>
1067 :     <tr>
1068 :     <td align="undefined" valign="undefined">M29</td>
1069 :     <td align="undefined" valign="undefined">split
1070 :     the cell into original and daughter</td>
1071 :     </tr>
1072 :     </tbody>
1073 :     </table><br><h2>Problems in BioInformatics that Can Be Done Once the Notion of "Function" Exists</h2><br>The
1074 :     inventory of machines has led us (albeit circuitously) into a
1075 :     discussion of "the function of a protein" and how to think about it.
1076 :     &nbsp;These problems relate to the use of comparative analysis between
1077 :     the protein sequences from many distinct genomes (and what clues we can
1078 :     expect to develop in our attempts to make sense of it all).<br><br>
1079 :    
1080 :     <h4>Identifying the Functions of Genes</h4>The
1081 :     general topic of how assign function to genes is central to genome
1082 :     annotation. &nbsp;Deciding when you can safely project function based
1083 :     on similarity is a topic that can profitably be pondered.
1084 :     <p>
1085 :     Before leaving this topic, it is worth noting that a site called
1086 :     <a href=http://clearinghouse.nmpdr.org/aclh.cgi>The Annotation Clearinghouse</a>
1087 :     exists. This resource will allow users to download assertions of
1088 :     function that are considered to be reasonably reliable by human
1089 :     annotators manually curating the growing body of data. The assertions
1090 :     use widely differing IDs for genes (but a table for interconverting
1091 :     the IDs is provided), they use an uncontrolled vocabulary (although
1092 :     progress is being made in developing synonym lists), and many of the
1093 :     assertions are undoubtedly wrong. However, it is a start on a
1094 :     resource of central importance.
1095 :     <br><br>
1096 :    
1097 :     <h4>Predicting When Two Genes Implement Related Functions</h4>There
1098 :     are many clues that can be used to improve the accuracy of function
1099 :     projection. &nbsp;Conservation of contiguity, detection of gene
1100 :     fusions, protein-protein interaction data, and characterization of
1101 :     regulatory sites have all proven useful &nbsp;Integration of clues from
1102 :     a number of sources has been attempted (and will undoubtedly be
1103 :     important in the future).<br>
1104 :     <p>
1105 :     In our view, the most useful set of clues to date have arisen from
1106 :     recognizing that genes that implement closely related functions (i.e.,
1107 :     functions that are part of the same machine or machines that implement
1108 :     connected functions) often occur close to one another in the genome.
1109 :     That is, if you take the genes that implement a machine, and you look
1110 :     at where these genes occur in the genome, the occurrences are not
1111 :     random. On average, about 50% of the genes that make up a machine will occur within
1112 :     5000 characters of one another in the genome. In some genomes far
1113 :     fewer genes cluster (for reasons we do not fully understand).
1114 :     <p>
1115 :     To exploit this tendency, we might construct sets of pairs of genes.
1116 :     All pairs in a set occur close together in a genome (one of the ones
1117 :     in our collection). All of the first members of pairs are similar to
1118 :     one another, and all of the second members are similar to one another.
1119 :     The fact that all of the 2-tuples in each set have corresponding pairs
1120 :     that are similar might lead one to believe that all of the pairs
1121 :     implemented the same two abstract functions, but that is not the
1122 :     case. It is often, and perhaps usually, the case; but, there are many
1123 :     instances where the pairs implement distinct functions. For example,
1124 :     there are many cases in which 4 close genes implement a transport
1125 :     machine. For each of these transport machines, even though they
1126 :     transport completely different compounds, 3 of the 4 genes are pretty
1127 :     similar. The fourth gene is often the one that is specific to the
1128 :     compound being transported.
1129 :     <p>
1130 :     What we can say, assuming that we find enough entries in a set (that
1131 :     is way more coresponding pairs than one would expect by random), is
1132 :     that the functions of the genes in each pair are related. We cannot
1133 :     say with reliability that the actual functions in all of the pairs
1134 :     match up, but the ones in each pair will usually be related.
1135 :     <p>
1136 :     Further, a single protein might well participate in pairs from several
1137 :     sets. By combining the evidence from all of these sets of pairs, it
1138 :     is possible to produce an estimate of all of the components in a
1139 :     machine, without really knowing the functions of any of them. That
1140 :     is, it becomes possible to say "I think that these four genes
1141 :     implement a machine", and to do so without having a clear idea of what
1142 :     the machine actually does.
1143 :     The information produced by examining conserved contiguity has not
1144 :     really been completely exploited. It has proved to be immensely
1145 :     useful, but there is far more to be gleaned from this data by those
1146 :     with some minimal creativity and statistical competence.
1147 :     <p>
1148 :    
1149 :     <h4>Grouping Genes into Subsystems</h4>The genes that encode proteins that together implement a single machine may be thought of as an instance of a <span style="font-style: italic;">subsystem</span>.
1150 :     &nbsp;In later tutorials we will discuss the notion of subsystem in
1151 :     more detail. &nbsp;Essentially, it is an abstraction of the notion of
1152 :     machine, and it represents an important conceptual framework for
1153 :     analyzing the functions of genes from many genomes simultaneously.
1154 :     &nbsp;So, how can you detect when two genes are components of the same
1155 :     machine?<br><br>
1156 :    
1157 :     <h4>Constructing Sets of Isofunctional Homologs</h4>Homologs
1158 :     are genes that share a common ancestor. &nbsp;Isofunctional genes
1159 :     implement the same function. &nbsp;The goal of compiling sets of
1160 :     homologous genes (and the proteins they encode) that implement a single
1161 :     function is central to automating annotation of genomes. &nbsp;Further,
1162 :     since we will be faced with annotating thousands of new genomes over
1163 :     the next few years (and it increases much more rapidly after that),
1164 :     almost all annotations will be automated.<br><br>
1165 :    
1166 :     <h4>Supporting Decision Procedures for Sets of Isofunctional Homologs</h4>Suppose
1167 :     that you have a collection of sets of isofunctional homologs.
1168 :     &nbsp;Suppose further that you have, say, 10,000 of these sets.
1169 :     &nbsp;For each set, you will wish to develop a decision procedure
1170 :     which, when given&nbsp;as input a set and a new protein sequence,
1171 :     determines whether or not the protein should be added to the set.
1172 :     &nbsp;In some cases, such decisions are easy, and you will wish to use
1173 :     a very fast decision procedure. &nbsp;In others, they are very
1174 :     difficult, and you will need to bring many sources of clues to
1175 :     bear.<br>Construction of such decision procedures will become
1176 :     increasingly important.<br><br>
1177 : overbeek 1.2
1178 : overbeek 1.6 <h4>Characterization of Regulons for a Genome</h4>Genes
1179 :     are often co-regulated. &nbsp;That is, expression of a set of genes may
1180 :     always be tightly coordinated. &nbsp;In this case, we will think of the
1181 :     co-regulated set as a <span style="font-weight: bold;">regulon</span>.
1182 :     &nbsp;Determination of which genes make up which regulons is a task
1183 :     requiring both bioinformatic challenges and wet lab confirmations.
1184 :     &nbsp;Don't attempt this one without a close working relationship with
1185 :     a wet lab biologist.<br><br><span style="font-weight: bold;">Charaterization of "States of the Cell"<br><br></span>It might be conjectured that a cell has a limited set of <span style="font-weight: bold;">states</span>.
1186 :     &nbsp;Each state is characterized by the set of regulons that are
1187 :     expressed. &nbsp;It seems likely that the cell should be viewed as
1188 :     "tending to stay in the same state" until forced to make a transition
1189 :     to another state. &nbsp;That is, the states demonstrate a degree of <span style="font-style: italic;">homeostasis.</span>
1190 :     &nbsp;If we underatnd a comprehensive list of states, and we worked out
1191 :     the forces that determine transitions, we would begin to understand the
1192 :     cell as a dynamic system.<br>
1193 : overbeek 1.1
1194 : overbeek 1.6 </body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3