[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Annotation of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (view) (download) (as text)

1 : overbeek 1.1 <div align=center>
2 :     <h1>The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:</h1>
3 :     <br>
4 :     <h1>An Abstract View</h1>
5 :     <h2>by Ross Overbeek</h2>
6 :     </div>
7 : overbeek 1.4 <h2>Introduction</h2>
8 :     This strange document began as a tutorial for computer scientists and mathematicians. It was supposed
9 :     to somehow introduce them to the computational issues in genome analysis.
10 :     It was requested by an instructor in a computer class. Overbeek in attempting to respond to this request
11 :     formulated an abstraction that he began to believe had significance beyond the tutorial.
12 :     <p>
13 :     This document is a set of working notes relating to the abstract. It is not organized properly as
14 :     an abstraction, a tutorial, or an essay on the role of bioinformatics in support of biological research. It is,
15 :     however, organized properly as a working document that relates to all of these goals.
16 :     <p>
17 :     It begins with a development of the abstraction. This will be suitable for mathematicians or computer scientists.
18 :     The abstraction is developed in four steps: the basic abstraction, the enhanced abstraction needed to support
19 :     basic bioinformatics support for biologists, and finally the third step which includes suport for the notion
20 :     of regulation. The intent throughout this discussion will be to seek a minimal set of concepts needed to
21 :     effectively capture the essence of the required data. Unlike almost all efforts to lay a foundation
22 :     for tutorials, software or research in biology, this effort focuses on leaving out as much as possible.
23 :     While we do believe that there is an almost unlimited complexity that can be introduced, and almost all of
24 :     it is needed for some specific goals, the vast majority of tools and discussions require (we believe) relatively few
25 :     concepts. As they say, "the proof is in the pudding."
26 :    
27 :     <p>
28 :     The second section will feature a bit more tutorial comments. It may well repeat much of what is in Part 1.
29 :     This part is offered as a way of easing a computer scientist of mathematician into the issues that need to be
30 :     considered, if they wish to try to do useful research relating to the genomics revolution. Eventually, this part
31 :     will be dramatically expanded by giving condensed summaries of the machines of the cell broken into two broad
32 :     sets: the metabolic network and the cellular machinery not directly included in the metabolic network. Loosely,
33 :     this separates what would be learned in a microbial biochemistry class (when they exist) from what would
34 :     be learned in a course on molecular biology.
35 :     <p>
36 :     The third part is an essay is an attempt to characterize our view on
37 :     <ul>
38 :     <li> what the main goals should be in current efforts to advance biological knowledge via genome research,
39 :     <li> what role bioinformatics researchers have played in the past, and
40 :     <li> what role they could productively play during the coming few years.
41 :     </ul>
42 :     As such, it is undoubtedly an arrogant formulation by a group of individuals with minimal background in
43 :     biology.
44 :     <p>
45 :     The fourth section will focus on the imlications of the abstractions in software development.
46 :     This is a bit of a radical proposal that makes sense to us (and is in an area that we can
47 :     legitimately claim expertise).
48 : overbeek 1.1
49 : overbeek 1.4 <h1>Part 1: The Abstractions</h1>
50 :     <h2>The cell: a Minimal Perspective</h2>
51 : overbeek 1.1
52 :     A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
53 :     <p>
54 : overbeek 1.3 By the term <b>compound</b> we refer to the normal notion of chemical compound.
55 : overbeek 1.1 <p>
56 :    
57 : overbeek 1.3 A <b>cellular machine</b> is a set of proteins that together perform a function. Unless otherwise noted,
58 :     when we use the term <i>machine</i> we will always be speaking of a cellular machine.
59 :     Many machines
60 :     transform one set of compounds into another set. Some machines (transport machines)
61 : overbeek 1.1 are used to move compounds into
62 : overbeek 1.3 or out of the cell. Later we will try to convey a more comprehensive notion of what functions are implemented
63 :     by machines that we understand.
64 : overbeek 1.1 <p>
65 :    
66 :     A <b>protein</b> is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
67 :     <p>
68 :    
69 :     A <b>genome</b> is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).
70 :     <p>
71 :    
72 :     A <b>gene</b> is a region in the genome that describes how to build a
73 :     protein. The description is a sequence of 3-character codons. Each
74 :     codon corresponds to either a single amino acid or a stop codon.
75 :     There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
76 :     table of correspondences between codons and amino acids:
77 :     <br><br>
78 :     <table border>
79 :     <tr><th>Amino Acid</th><th>Codons</th></tr>
80 :     <tr><td>A</td> <td>GCT, GCC, GCA, GCG </td></tr>
81 :     <tr><td>C</td> <td>TGT, TGC</td></tr>
82 :     <tr><td>D</td> <td>GAT, GAC</td></tr>
83 :     <tr><td>E</td> <td>GAA, GAG</td></tr>
84 :     <tr><td>F</td> <td>TTT, TTC</td></tr>
85 :     <tr><td>G</td> <td>GGT, GGC, GGA, GGG</td></tr>
86 :     <tr><td>H</td> <td>CAT, CAC</td></tr>
87 :     <tr><td>I</td> <td>ATT, ATC, ATA</td></tr>
88 :     <tr><td>K</td> <td>AAA, AAG</td></tr>
89 :     <tr><td>L</td> <td>TTA, TTG, CTT, CTC, CTA, CTG</td></tr>
90 :     <tr><td>M</td> <td>ATG</td></tr>
91 :     <tr><td>N</td> <td>AAT, AAC</td></tr>
92 :     <tr><td>P</td> <td>CCT, CCC, CCA, CCG</td></tr>
93 :     <tr><td>Q</td> <td>CAA, CAG</td></tr>
94 :     <tr><td>R</td> <td>CGT, CGC, CGA, CGG, AGA, AGG</td></tr>
95 :     <tr><td>S</td> <td>TCT, TCC, TCA, TCG, AGT, AGC</td></tr>
96 :     <tr><td>T</td> <td>ACT, ACC, ACA, ACG</td></tr>
97 :     <tr><td>V</td> <td>GTT, GTC, GTA, GTG</td></tr>
98 :     <tr><td>W</td> <td>TGG</td></tr>
99 :     <tr><td>Y</td> <td>TAT, TAC</td></tr>
100 :     <tr><td>*</td> <td>TAG, TGA, TAA [Stop codons]</td></tr>
101 :     </table>
102 :     <br><br>
103 :     <hr>
104 : overbeek 1.4 The process of building a protein as a string of amino acids from the gene containing codons is
105 :     called <b>expressing</b> the gene.
106 :     <br>
107 :     A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.
108 :     Each protein implements one or more functional roles. The set of functional roles
109 :     implemented by the protein is called the <b>function of the protein</b>. The function of a multifunctional
110 :     protein that implements {functional-role-1,functional-role-2} is normally written as
111 :     <i>functional-role-1 / functional-role-2</i>.
112 :     <br><br>
113 :     A <b>populated subsystem</b> is a subsystem with an attached spreadsheet. Each column
114 :     in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to
115 :     a specific genome. Each cell in the spreadsheet contains the genes from the corresponding genome
116 :     that implement the designated functional role (there may be 0 or more such genes).
117 :     <br><br>
118 :     We do not actually know what machines are present in a cell. We are in the midst of a grand
119 :     effort to clarify which are there and what they do. The formulation of subsystems as abstract machines
120 :     in which each row of the subsystem describes a specific cellular machine that is believed to be present,
121 :     represents a way to maintain a collection of estimates or assertions.
122 :     <p>
123 :     A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
124 :     are similar over the entire lengths of the proteins.
125 :     <p>
126 :     We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
127 :     <p>
128 :     In any specific cell, sets of specific cellular machines are
129 :     switched on and off as units. That is, they are <i>co-regulated</i>. We will call such a set
130 :     of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
131 :     a single cellular machine). A <b>state</b> of a cell will be defined
132 :     as the set of regulons that are operational at a point in time. Thus, a state amounts to the set
133 :     of cellular machines that are operational at one instant.
134 :     <p>
135 :     Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
136 :     cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
137 :     second list contains genes that were "active" in the second but not the first. If a cellular
138 :     machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
139 :     only one cellular machine, then it would be reasonable to infer that you could say that the machine was
140 :     active in the first state, but not the second.
141 :    
142 :     <h2>The cell: the Enhanced Formlation Needed to Support Bioinformatics</h2>
143 :    
144 :     In the enhanced abstraction, we need to losen up some concepts. In particular,
145 :     <ul>
146 :     <li> A <b>genome</b> is a set of strings in a 4-character alphabet. Each of the strings
147 :     is called a <b>contig</b>. Note that the concept as formulated covers both incomplete genomes and
148 :     genomes with multiple replicons.
149 :    
150 :     <li>The genes within a genome are of two distinct types:
151 :     <ol>
152 :     <li>those that describe how to construct a protein (i.e., prtein-encoding genes), and
153 :     <li>those that describe how to construct a string of RNA (i.e., how to construct a string in the
154 :     4-character RNA alphabet {A,C,G,U}).
155 :     </ol>
156 :     <br><br>
157 :     <li>The location of a gene is generalized to be a set of regions within the genome (that are
158 :     concatenated to form the instructions needed to construct either a protein or a string of RNA).
159 :     <li>A protein is a character in an alphabet that now includes the 20 character codes from
160 :     the basic abstraction plus a very limited set of extra codes.
161 :     We already have cases in which <i>selenocyctein</i> and <i>pyrrolysine</i> appear as nonstandard
162 :     translations of codons, and there may eventually be more.
163 :    
164 :     <li>Each protein-encoding gene has both a DNA sequence (by defintion) and a translation. However,
165 :     the translation is not required to exactly match what a codon-by-codon translation of the DNA sequence
166 :     would produce. This allows us to handle the very rare instances in which selenocystein occurs as the translatin
167 :     of TGA or pyrrolysine occurs as a translation of TAG (and others, if necessary).
168 :     </ul>
169 :    
170 :     This loosened up formulation represents a very minimal set of changes. They should be left out of the
171 :     basic tutorial for computer scientists and mathematicians.
172 :    
173 :     <h2>The cell: Adding the Concepts Needed to Discuss Transcriptional Regulation</h2>
174 :    
175 :     In the final version of the abstraction, we add the minimal set of notions needed to support
176 :     analysis of transcriptional regulation. An <b>operon</b> is a set of contiguous genes that are all
177 :     on the same strand and are all co-regulated. We consider a gene that is not co-regulated with any adjacent genes
178 :     to be an operon composed of just itself. A <b>binding site</b> is a small region of DNA (normally
179 :     occurring a short space ahead of an operon) that acts as a switch turning the operon "on" or "off". When
180 :     a specific protein or expressed RNA called a <b>transcriptional regulator</b> binds the site, it flips the switch. One or more
181 :     specific transcriptional regulators can bind a specific site (i.e., sets of
182 :     sites are associated with each specific transcriptional regulator). The effect of a regulator binding at a site
183 :     always has the same effect (either activating or deactivating the operon), but which effect depends on
184 :     the site-regulator pair.
185 :    
186 :     <h1>Part 1: Tutorial Notes</h1>
187 :    
188 :     <h2>Notes for The Basic Abstraction</h2>
189 :    
190 : overbeek 1.3 We will be speaking about organisms that are a single cell. At some point life began on earth.
191 :     The single-celled organisms that we know of replicate producing copies of themselves that have
192 :     genomes which usually have very, very similar content to that of the parent cell. <b>Evolution</b> is the
193 :     process in which cells replicate with some alterations in their genomes, are subjected to
194 :     <i>selective pressure</i>, and survive or not depending on many somewhat random factors. The makeup of
195 :     cells (i.e., the genomes they contain and the machines that define what they are capable of doing)
196 :     changes gradually (and sometimes not so gradually) as time passes.
197 :     <p>
198 :     The original life forms that existed billions of years ago have evolved into three broad categories of
199 :     life forms. That is, the evolutinary process led to early divisions, and these led to three main
200 :     categories of single-celled organisms. We call these three forms the <b>archaea</b>,
201 :     the <b>bacteria</b>, and the <b>eukaryotes</b>.
202 :     A majority of the organisms for which we have acquired complete genomes are from the bacteria,
203 :     although the
204 :     numbers are rapidly growing for all three domains.
205 :     <p>
206 : overbeek 1.4 The minimal notion of a cell is enough to explain some of the basic
207 : overbeek 1.1 problems in bioinformatics:
208 :    
209 :     <h3>Identify the genes within a genome</h3>
210 :    
211 : overbeek 1.3 If we are to understand the contents of genomes, we will need to
212 :     locate the genes that occur in each genome. This problem simply involves taking a genome (a
213 :     string of DNA) and locating the set of genes it contains.
214 :     In the case of bacteria and archaea, we know pretty well how to
215 :     locate the genes.
216 :     Once we
217 :     have identified instances from many genomes, it becomes possible to
218 :     recognize the genes in a new genome by just looking for things similar
219 :     to those we already understand. The following problem is At the heart of reconizing when two
220 :     genes are "similar".
221 :    
222 :     <h3>Given two genes. "align" them in a way that minimizes some edit function. </h3>
223 : overbeek 1.1
224 : overbeek 1.3 For example, here is what you see when you align two genes from distinct organisms:
225 : overbeek 1.1
226 :     <pre>
227 :    
228 :    
229 : overbeek 1.3 gene1 ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC
230 :     gene2 ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC
231 :     * * * * * *** *** ** * **** * *** **** * ***
232 :    
233 :     gene1 GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT
234 :     gene2 GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC
235 :     *** **** ** *** *** ** * **** *** ** * * *** *
236 :    
237 :     gene1 CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC
238 :     gene2 ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC
239 :     ***** ***** *** **** * ** ** * ** *** ****** ***
240 :    
241 :     gene1 ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG
242 :     gene2 ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG
243 :     ********* **** ** ** ** ** ***** * ** ** ** *** ** *
244 :    
245 :     gene1 GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC
246 :     gene2 GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC
247 :     ** ** *** *********** ** **** *** ** * * ***
248 :    
249 :     gene1 GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG
250 :     gene2 GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG
251 :     ******** ** ***** ***** * * ** ** * *****
252 :    
253 :     gene1 GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC
254 :     gene2 AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG
255 :     * **** * ********* ******* *** * *** ********
256 :    
257 :     gene1 GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA
258 :     gene2 GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA
259 :     * ******** *** *** *** * ** ** * * ** ** * ****
260 :     </pre>
261 :     <hr>
262 :    
263 :     The sequences are recognizably similar, and in fact implement exactly the same function
264 :     in the two cells. If we align the protein sequences corresponding to these two
265 :     genes, we get
266 :    
267 :     <pre>
268 :     gene1 MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN
269 :     gene2 -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN
270 :     :* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***
271 :    
272 :     gene1 IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT
273 :     gene2 IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E
274 :     ***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:
275 :    
276 :     gene1 VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR
277 :     gene2 TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ
278 :     .:***** **.: :*.*** **::*:* * :**:.:*:
279 : overbeek 1.1 </pre>
280 :    
281 : overbeek 1.3 There is a great deal of work relating to recognizing when two sequences are
282 :     similar and whether or not they had a common ancestor. Understanding why
283 :     selective pressure conserves sections of sequences, but not others, will yield
284 :     important clues. Can you reason out why some sections might be conserved, while
285 :     others vary wildly?
286 :     <p>
287 :    
288 :     Comparing sets of sequences that have retained the same function is
289 :     at the heart of understanding cellular machines and the proteins that implement them.
290 :     We find that looking at sets (often with more than two sequences) and aligning them
291 :     is important.
292 :    
293 : overbeek 1.1
294 :     <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>
295 :    
296 :     Here is an example of a multiple sequence alignment:
297 :     <br>
298 :     <br>
299 :     <pre>
300 :     CLUSTAL W (1.83) multiple sequence alignment
301 :    
302 :    
303 :     seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
304 :     seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
305 :     seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
306 :     seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
307 :     seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
308 :     *. . . .: :.: **..: ** .* : :
309 :    
310 :     seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
311 :     seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
312 :     seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
313 :     seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
314 :     seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL
315 :     * : : .: :. .. :.. : : .*: . *
316 :    
317 :     seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
318 :     seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
319 :     seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
320 :     seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
321 :     seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV
322 :     ************.. :::..:: . * : : :: *******:*. . :.:
323 :    
324 :     seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
325 :     seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
326 :     seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
327 :     seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
328 :     seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA
329 :     :::*:.: * :* : *: .: : * ** ** :** * * : * :.
330 :    
331 :     seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
332 :     seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
333 :     seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
334 :     seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
335 :     seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI
336 :     **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::
337 :    
338 :     seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
339 :     seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
340 :     seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
341 :     seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
342 :     seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------
343 :     *.* *: . : : . : ::: :**: ..*: * : *.
344 :    
345 :     seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
346 :     seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
347 :     seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
348 :     seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
349 :     seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG
350 :     . : : :** * . * :
351 :    
352 :     seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
353 :     seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
354 :     seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
355 :     seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
356 :     seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS
357 :     : * ****.** ::: :* * * : : .:
358 :    
359 :     seq3 FVSQHGNRGKPL
360 :     seq4 FMSGHLGA----
361 :     seq5 FIEKKAL-----
362 :     seq1 LMMNHQ------
363 :     seq2 YLLGK-------
364 :     : :
365 :     </pre>
366 :    
367 :     <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>
368 :    
369 : overbeek 1.3 From the extant five sequences that are similar and displayed in the previous alignment, we can construct
370 :     a tree that depicts the "phylogenetic history" of the sequences.
371 :     Here is one reasonable tree for the last 5 sequences.
372 :    
373 : overbeek 1.1 <pre>
374 : overbeek 1.3 ,--------------------------------------------------- seq1
375 :     |
376 :     |
377 :     ,------------------|
378 :     | |
379 :     | |
380 :     | `---------------------------------------------- seq2
381 :     |
382 :     |
383 :     |
384 :     ,----|
385 :     | |
386 :     | | ,-------------------------------- seq3
387 :     | | |
388 :     | | |
389 :     | |-------------|
390 : overbeek 1.1 | |
391 :     | |
392 : overbeek 1.3 | `------------------------------ seq4
393 : overbeek 1.1 |
394 :     |
395 :     `---------------------------------------------- seq5
396 :     </pre>
397 :    
398 : overbeek 1.3 The tree suggests that at some point an ancestral
399 :     cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>, while the remaining sequences descend
400 :     from the ther copy.
401 :     <p>
402 :     Note that we now have alignments that
403 :     contain thousands of sequences, and even displaying such trees is nontrivial.
404 :     Because evolution plays such a central role in the phenomena we study, the construction of alignments
405 :     and trees in order to compare extant versions of proteins and gain insight into their historical origins
406 :     is considered basic to the task at hand.
407 : overbeek 1.1
408 : overbeek 1.4 <h3>Some Random Facts that You Should Absorb</h2>
409 : overbeek 1.1
410 :     Most genomes of bacteria contain between 400,000 and 12,000,000 characters.
411 :     Normally, the genes in a genome
412 :     cover abut 90% of the genome.
413 :     Normally, there is about one gene per 1000 characters in a bacterial genome.
414 :     <p>
415 :     So,
416 :     <ul>
417 :     <li> What is the length of the average protein sequence?
418 :     <li>How many genes do these
419 :     genomes have?
420 :     <li>What is the average length of a gene?
421 :     </ul>
422 :     <br>
423 : overbeek 1.3 It is worth spending just a short bit of time thinking about what types of
424 :     machines must exist in each cell. Here are a few thoughts to start with
425 : overbeek 1.1 <ul>
426 :     <li>
427 :     There must be one or more machines that support replication of the cell. You would
428 :     need something to copy the genome, and you would need something that could build the DNA
429 :     bases that represent the characters (i.e., you will need machines to build the molecules
430 :     corresponding to each of the four characters in the alphabet of DNA bases.
431 :     <li>
432 :     As we mentioned, you have transport machines that take things into and out of the cell. Many
433 :     cells can import food in the form of sugar molecules. For example, many cells can import
434 :     <i>glucose</i> a six-carbon compound. As the compound gets broken down into smaller compounds,
435 :     energy is salvaged from the broken bonds to power the machines in the cell. The smaller compounds
436 :     are used as building blocks for other needs.
437 :     <li>
438 :     There must be one or more machines involved in building proteins from the descriptions in te genes.
439 :     In particular, we will need a machine for each of the amino acids (unless the cell can import some
440 :     of them).
441 :     <li>
442 :     There must be mechanisms for sensing what is going on in the environment and allowing the cell
443 :     to react to it. For example, many cells can "swim" towards food.
444 :     </ul>
445 :     Those were just a few examples. For any cell, we have many, many machines, and we still
446 : overbeek 1.3 do not even understand what some of them do. Later, we will try to offer a more structured
447 :     estimate of what is already known.
448 : overbeek 1.1 <p>
449 :     About 50-60% of the genes occur within 5000 characters of another gene such that
450 : overbeek 1.3 the two genes encode proteins that are part of the same cellular machine. This fact
451 :     suggests that just having a large number of genomes would enable a person to group
452 :     the genes into the machines they implement, without the person understanding the functions
453 :     of the machines or the roles played by each protein.
454 : overbeek 1.1 <p>
455 :     Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
456 :     a few cells. In these cases, the fused gene is (by definition) part of a single machine, and
457 :     in most cells in which the proteins are not fused, the two distinct proteins are separate components
458 : overbeek 1.3 of a single machine. This, too, offers clues to support analysis of which proteins go with which machines.
459 : overbeek 1.1 <p>
460 :     Biologists have figured out the roles of about 50% of the genes. That is, they can
461 :     place the gene in a cellular machine, they know what the machine does, and they know
462 :     the specific role of the gene in sustaining the functionality of the machine.
463 :     <br><br>
464 :    
465 : overbeek 1.4 <h23Imposing a Structure on Characterizing the Inventory</h2>
466 : overbeek 1.1
467 :     One central goal of bioinformatics is to support an accurate characterization of the cellular
468 :     machinery for each cell. It is of major importance to biologsts that we be able to support
469 :     comparative analysis of cells. Perhaps, the most important aspect of understanding cells relates to
470 :     their origin in an evolutionary process. Cells have a long evolutionary history dating back billions of
471 :     years. The machines we see in cells today arose in the past, so we expect to see many current cells
472 :     using machinery that resembles what turns up in other cells. When we compare machines from different
473 :     cells they often look remarkably similar. On the other hand, those that had a common origin in a cell that existed
474 :     billions of years in the past may now have versions that are not very similar. Modifications, optimizations,
475 :     and insignificant alterations all combine to explore the space of operational possibilities for
476 :     each type of machine. Hence, we need a framework for studying similarities and differences in the
477 :     cellular machines and the proteins that implement them.
478 :     <p>
479 :    
480 :     Here is a short formulation of one way to do this:
481 :     <br><br>
482 :     <ul>
483 :     <li>A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.
484 :     <li>Each protein implements one or more functional roles. The set of functional roles
485 :     implemented by the protein is called the <b>function of the protein</b>. The function of a multifunctional
486 :     protein that implements {functional-role-1,functional-role-2} is normally written as
487 :     <i>functional-role-1 / functional-role-2</i>.
488 :     <br><br>
489 :     <li>A <b>populated subsystem</b> is a subsystem with an attached spreadsheet. Each column
490 :     in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to
491 :     a specific genome. Each cell in the spreadsheet contains the genes from the corresponding genome
492 :     that implement the designated functional role (there may be 0 or more such genes).
493 :     </ul>
494 :     <br><br>
495 :     We do not actually know what machines are present in a cell. We are in the midst of a grand
496 :     effort to clarify which are there and what they do. The formulation of subsystems as abstract machines
497 :     in which each row of the subsystem describes a specific cellular machine that is believed to be present,
498 :     represents a way to maintain a collection of estimates or assertions.
499 :     <p>
500 :     A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
501 :     are similar over the entire lengths of the proteins.
502 :     <p>
503 :     We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
504 :     The computational tasks imposed by such a goal are obvious:
505 :     <ul>
506 :     <li>We need to consruct databases that implement at least the following entities:
507 :     <ol>
508 :     <li>cells (i.e., each cell must have an ID and a set of attributes),
509 :     <li>genomes,
510 :     <li>genes,
511 :     <li>proteins,
512 :     <li>functional roles,
513 :     <li>subsystems, and
514 :     <li>protein families.
515 :     </ol>
516 :     <li> We need to add support for developing clues to function by integrating data
517 :     from sources like proximity within the genome, fusions, etc.
518 :     <li>We need to support a framework for the development of populated subsystems.
519 :     <li>We need to construct decision procedures for membership in protein families. Some
520 :     of these procedures will be quite complex, although the majority of cases can be
521 :     handled by fairly general procedures.
522 :     </ul>
523 :    
524 : overbeek 1.4 <h3>States of the Cell</h2>
525 : overbeek 1.1
526 : overbeek 1.2 The notion of <i>subsystem</i> was introduced as an <i>abstract machine</i> -- that is, as an
527 :     attempt to create a framework for understanding variations within specific celular machines via
528 :     a form of comparative analysis.
529 :    
530 :     In any specific cell, sets of specific cellular machines are
531 :     switched on and off as units. That is, they are <i>co-regulated</i>. We will call such a set
532 :     of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
533 :     a single cellular machine). A <b>state</b> of a cell will be defined
534 :     as the set of regulons that are operational at a point in time. Thus, a state amounts to the set
535 :     of cellular machines that are operational at one instant.
536 : overbeek 1.1 <p>
537 : overbeek 1.2 If we think of a car as a bag of machines that interact to make it function, we might consider there
538 :     to be a huge number of states. There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off. However, we can divide the states of a car into major groupings based on the status
539 :     of some key "machines". For example, "off" (the state in which the engine is turned off and the car is parked) and
540 :     "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into
541 :     two "major states".
542 : overbeek 1.1 <p>
543 : overbeek 1.2 Similarly, I believe that we should think about <i>major states of the cell</i> as being determined by the
544 :     functioning (or not) of a limited set of regulons. The determination of these regulons, the major states,
545 :     and how transitions between are managed all are now parts of the picture being filed in.
546 :    
547 :    
548 : overbeek 1.4 <h3>Microarrays</h2>
549 : overbeek 1.2
550 :     Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
551 :     cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
552 :     second list contains genes that were "active" in the second but not the first. If a cellular
553 :     machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
554 :     only one cellular machine, then it would be reasonable to infer that you could say that the machine was
555 :     active in the first state, but not the second. If one knew the regulons for a specific cell, it would go
556 :     a long way to suport extraction of insights from these microarrays. On the other hand, if one had many,
557 :     many microarrays, and if the specific cellular machines for the cell are known, then one could make
558 :     substantial progress in uncovering the exact composition of the regulons that make up the cell.
559 : overbeek 1.1
560 : overbeek 1.4 <h2>Notes for the Enhanced Abstraction</h2>
561 :    
562 :     The process of <b>expressing a gene</b> amounts to using the gene to produce the functional component of
563 :     a machine (a protein for a protein-encoding gene, and an RNA for an RNA-encoding gene).
564 :     The process of expressing a protein-encoding gene takes a gene (a string of DNA formed by concatenating a sequence of
565 :     regions from contigs) and producing a protein is normally thought of as taking place in two steps.
566 :     <b>Transcription</b> is the process of a specific machine moving along the contig and making a copy of the
567 :     gene as RNA. This string of RNA is then <b>translated</b> by a separate machine. The machine that performs
568 :     the copying of the gene into a string of RNA is called an <b>RNA polymerase</b>. The machine to translate
569 :     the RNA into a protein, the <b>ribosome</b>, is made up of both proteins and RNA components.
570 :     <p>
571 :     Machines can be made up of both protein and RNA components, although most machines are built from
572 :     just proteins. Some of the most fundamental questions in biology relate to how life started and the steps
573 :     required to gradually enrich the basic machinery to the point where this magnificent information storage and
574 :     maintenance system based on DNA, RNA and proteins could have arisen. There is much that can be inferred by
575 :     reasoning back from what we now observe and reasoning forward from the relatively little we know of
576 :     what the early earth was like. One possible set of goals would be to first understand in detail the inventory
577 :     of components we now see in life forms, composing something analogous to a CAD/CAM system describing life forms.
578 :     Then, as a second step, to understand the sequence of transformations that led from some initial raw components
579 :     to initial life forms to those we have seen and characterized.
580 :     <p>
581 :     The need to allow occasional "nonstandard" characters in protein sequences and a loosening of the corespondence
582 :     between a gene and characters in the protein sequence it can be used to build results from the fact that
583 :     evolution has produced the existing genetic codes and they continue to evolve (either converging or diverging
584 :     depending on the outcome of basically random processes operating under selective pressure).
585 :     <br>
586 :    
587 :     <h2>Notes on the Abstraction Extended to Support Regulation</h2>
588 :    
589 :     There are two basically different regulatory mechanisms in the cell. In one, you have a metabolic
590 :     network in which fluxes are tightly controlled by positive and negative feeback loops. This <b>metabolic
591 :     regulation</b> occurs very rapidly. <b>Transcriptional regulation</b> occurs orders of magnitude more
592 :     slowly. It is just this transcriptional regulation that we consider in this extension.
593 :     <p>
594 :    
595 :     As the cell changes state, regulons are activated or de-activated by
596 :     transcriptional regulators (either protein or RNA) binding to specific
597 :     sites in the DNA. This model has the redeeming characteristic of
598 :     simplicity. It is certainly the case that there are innumerable
599 :     important issues that it disregards (e.g., regulation based on DNA
600 :     packaging, due to small RNAs binding the RNAs produced by
601 :     transcription, etc.). In forming any clear notion of transcriptional
602 :     regulation and how it is achieved, we will need to carefully separate
603 :     these different mechanisms, since they have fundamentally different
604 :     modes of control and operation. We are arguing that the notion of a
605 :     protein or RNA being used to flip regulons on and off by binding to
606 :     control sites within the genome is a major form of regulation and
607 :     probably the right place to start any effort to formulate a useful
608 :     abstraction.
609 :    
610 :     <h1>The Role of Bioinformatics in Supporting the Genomic Revolution</h1>
611 :    
612 :     Within the growing genomics revolution, one can easily divide developments and
613 :     goals into those relating to advances in medicine and agricultue from those relating to
614 :     pure science. Here we consider only issues relating to pushing advances in basic research.
615 :     Here is an overview of our perspective:
616 :     <ol>
617 :     <li> The different life forms that now exist were produced by an evolutionary process,
618 :     which leads to our view that comparative analysis is the key to understanding. Biological
619 :     machines that exist in complex forms will often also still exist in simpler forms (usually
620 :     in simpler organisms).
621 :     <li> Unravelling exactly how a machine works is more easily done in simpler organisms. They
622 :     are easier to work with, and it is easier to gather the data needed to support comparative analysis.
623 :    
624 :     <li> This leads to the view that we should try to understand single-celled organisms to lay
625 :     the foundation for analysis of multicelluar organisms.
626 :    
627 :     <li> The characterization of unicellular life will require access to orders of magnitude
628 :     more data than exist now (we have more-or-less complete genomes for about 1000 genomes, but
629 :     that represents a small fraction of a percent of extant single-celled life forms).
630 :    
631 :     <li> The immediate basic steps that are taking place are roughly:
632 :     <ol>
633 :     <li> Attempt to formulate a growing list of abstract machines that correspond
634 :     to the many specific machines that implement te same goal. These abstract machines (subsystems)
635 :     represent the basic units that make up life forms.
636 :    
637 :     <li> Create protein and RNA families in which the members are all homologous (share a common ancestor),
638 :     remain similar over almost all of the sequence, and all implement a common function.
639 :    
640 :     <li> Build alignments for each protein family, along with phylogenetic trees that represent
641 :     an estimate of the history of how these specific sequences evolved.
642 :    
643 :     <li>Provide a computational framework to support continued maintenance and development of these
644 :     basic data types.
645 :     </ol>
646 :    
647 :     <li> A limited number of groups have progressed to the point where they can create models of
648 :     an organism that display predictive capabilities. There are many forms of modeling. In our view
649 :     it is important that we reach the state where we can routinely model states of the cell, transitions
650 :     between states, and metabolic characteristics of the cell. We believe that it is now possible
651 :     to create fairly comprehensive representations of the metabolic networks of some bacteria.
652 :     In these cases, we have substantial amounts of physiological data, the number of abstract machines
653 :     in the cell is fairly limited, and it is possible to do compare the predictions against observed results.
654 :    
655 :    
656 :     </ol>
657 :    
658 :    
659 :     <br><br>
660 :     We do not actually know what machines are present in a cell. We are in the midst of a grand
661 :     effort to clarify which are there and what they do. Reaching a point where we have a near
662 :     complete overview of the basic inventory is arguably the highest priority at this point (we ignore
663 :     the medical revolution and numerous other wonderful advances, but...).
664 :    
665 :     The formulation of subsystems as abstract machines
666 :     in which each row of the subsystem describes a specific cellular machine that is believed to be present,
667 :     represents a way to maintain a collection of estimates or assertions.
668 :     <p>
669 :     A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
670 :     are similar over the entire lengths of the proteins.
671 :     <p>
672 :     We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
673 :     The computational tasks imposed by such a goal are obvious:
674 :     <ul>
675 :     <li>We need to consruct databases that implement at least the following entities:
676 :     <ol>
677 :     <li>cells (i.e., each cell must have an ID and a set of attributes),
678 :     <li>genomes,
679 :     <li>genes,
680 :     <li>proteins,
681 :     <li>functional roles,
682 :     <li>subsystems, and
683 :     <li>protein families.
684 :     </ol>
685 :     <li> We need to add support for developing clues to function by integrating data
686 :     from sources like proximity within the genome, fusions, etc.
687 :     <li>We need to support a framework for the development of populated subsystems.
688 :     <li>We need to construct decision procedures for membership in protein families. Some
689 :     of these procedures will be quite complex, although the majority of cases can be
690 :     handled by fairly general procedures.
691 :     </ul>
692 :    
693 :    
694 :     <h1> The Role of Abstraction in Setting the Stage for Software Development and Modeling</h1>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3