[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Annotation of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (view) (download) (as text)

1 : overbeek 1.1 <div align=center>
2 :     <h1>The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:</h1>
3 :     <br>
4 :     <h1>An Abstract View</h1>
5 :     <h2>by Ross Overbeek</h2>
6 :     </div>
7 :    
8 :     <h2>What Is a Cell?</h2>
9 :    
10 :     A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
11 :     <p>
12 :     By the term <b>compound</b> I refer to the normal notion of chemical compound.
13 :     <p>
14 :    
15 :     A <b>cellular machine</b> is a set of proteins that together perform a function. This function is often t
16 :     transform a set of compounds into another set. Some types of machines (transport machines)
17 :     are used to move compounds into
18 :     or out of the cell.
19 :     <p>
20 :    
21 :     A <b>protein</b> is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
22 :     <p>
23 :    
24 :     A <b>genome</b> is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).
25 :     <p>
26 :    
27 :     A <b>gene</b> is a region in the genome that describes how to build a
28 :     protein. The description is a sequence of 3-character codons. Each
29 :     codon corresponds to either a single amino acid or a stop codon.
30 :     There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
31 :     table of correspondences between codons and amino acids:
32 :     <br><br>
33 :     <table border>
34 :     <tr><th>Amino Acid</th><th>Codons</th></tr>
35 :     <tr><td>A</td> <td>GCT, GCC, GCA, GCG </td></tr>
36 :     <tr><td>C</td> <td>TGT, TGC</td></tr>
37 :     <tr><td>D</td> <td>GAT, GAC</td></tr>
38 :     <tr><td>E</td> <td>GAA, GAG</td></tr>
39 :     <tr><td>F</td> <td>TTT, TTC</td></tr>
40 :     <tr><td>G</td> <td>GGT, GGC, GGA, GGG</td></tr>
41 :     <tr><td>H</td> <td>CAT, CAC</td></tr>
42 :     <tr><td>I</td> <td>ATT, ATC, ATA</td></tr>
43 :     <tr><td>K</td> <td>AAA, AAG</td></tr>
44 :     <tr><td>L</td> <td>TTA, TTG, CTT, CTC, CTA, CTG</td></tr>
45 :     <tr><td>M</td> <td>ATG</td></tr>
46 :     <tr><td>N</td> <td>AAT, AAC</td></tr>
47 :     <tr><td>P</td> <td>CCT, CCC, CCA, CCG</td></tr>
48 :     <tr><td>Q</td> <td>CAA, CAG</td></tr>
49 :     <tr><td>R</td> <td>CGT, CGC, CGA, CGG, AGA, AGG</td></tr>
50 :     <tr><td>S</td> <td>TCT, TCC, TCA, TCG, AGT, AGC</td></tr>
51 :     <tr><td>T</td> <td>ACT, ACC, ACA, ACG</td></tr>
52 :     <tr><td>V</td> <td>GTT, GTC, GTA, GTG</td></tr>
53 :     <tr><td>W</td> <td>TGG</td></tr>
54 :     <tr><td>Y</td> <td>TAT, TAC</td></tr>
55 :     <tr><td>*</td> <td>TAG, TGA, TAA [Stop codons]</td></tr>
56 :     </table>
57 :     <br><br>
58 :     <hr>
59 :     This minimal notion of a cell is enough to explain some of the central
60 :     problems in bioinformatics:
61 :    
62 :     <h3>Identify the genes within a genome</h3>
63 :    
64 :     This problem simply involves taking a genome (a string of DNA) and locating
65 :     the set of genes it contains. Does the existence of 100s of genomes (genomes
66 :     with at least some estimate of where the genes occur) effect how you might do this?
67 :    
68 :     <h3>Given two proteins. "align" them in a way that minimizes some edit function. </h3>
69 :    
70 :     For example:
71 :     <br>
72 :     <br>
73 :     <pre>
74 :    
75 :     seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
76 :     seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
77 :     ** *. :.: .*: :**.:**..::***:* : :.
78 :    
79 :     seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG
80 :     seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG
81 :     **: :*.* : :: .**:*:::* * *:* * :: * .*:. *:.:: .******
82 :    
83 :     seq1 PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP
84 :     seq2 PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP
85 :     *******.******::* :: * * *.:*.*******:***:.* *: .::* :***
86 :    
87 :     seq1 RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR
88 :     seq2 KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR
89 :     :**:* :*** * *** ** :: :** ** ****** ** *:**: * *.******::*
90 :    
91 :     seq1 LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND
92 :     seq2 FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND
93 :     :*.* *** * *** : * : :*:.*******::****:.*.:****:*****.* **
94 :     </pre>
95 :    
96 :     shows an alignment of two proteins (called <i>seq1</i> and <i>seq2</i>).
97 :    
98 :     <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>
99 :    
100 :     Here is an example of a multiple sequence alignment:
101 :     <br>
102 :     <br>
103 :     <pre>
104 :     CLUSTAL W (1.83) multiple sequence alignment
105 :    
106 :    
107 :     seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
108 :     seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
109 :     seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
110 :     seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
111 :     seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
112 :     *. . . .: :.: **..: ** .* : :
113 :    
114 :     seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
115 :     seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
116 :     seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
117 :     seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
118 :     seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL
119 :     * : : .: :. .. :.. : : .*: . *
120 :    
121 :     seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
122 :     seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
123 :     seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
124 :     seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
125 :     seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV
126 :     ************.. :::..:: . * : : :: *******:*. . :.:
127 :    
128 :     seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
129 :     seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
130 :     seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
131 :     seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
132 :     seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA
133 :     :::*:.: * :* : *: .: : * ** ** :** * * : * :.
134 :    
135 :     seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
136 :     seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
137 :     seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
138 :     seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
139 :     seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI
140 :     **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::
141 :    
142 :     seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
143 :     seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
144 :     seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
145 :     seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
146 :     seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------
147 :     *.* *: . : : . : ::: :**: ..*: * : *.
148 :    
149 :     seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
150 :     seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
151 :     seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
152 :     seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
153 :     seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG
154 :     . : : :** * . * :
155 :    
156 :     seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
157 :     seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
158 :     seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
159 :     seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
160 :     seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS
161 :     : * ****.** ::: :* * * : : .:
162 :    
163 :     seq3 FVSQHGNRGKPL
164 :     seq4 FMSGHLGA----
165 :     seq5 FIEKKAL-----
166 :     seq1 LMMNHQ------
167 :     seq2 YLLGK-------
168 :     : :
169 :     </pre>
170 :    
171 :     <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>
172 :    
173 :     Here is one reasonable tree for the last 5 sequences. Note that we now have alignments that
174 :     contain thousands of sequences, and even displaying such trees is nontrivial.
175 :     <pre>
176 :     ,--------------------------------------------------- seq1
177 :     |
178 :     |
179 :     ,------------------|
180 :     | |
181 :     | |
182 :     | `---------------------------------------------- seq2
183 :     |
184 :     |
185 :     |
186 :     |
187 :     |
188 :     | ,-------------------------------- seq3
189 :     | |
190 :     | |
191 :     |-------------|
192 :     | |
193 :     | |
194 :     | `------------------------------ seq4
195 :     |
196 :     |
197 :     `---------------------------------------------- seq5
198 :     </pre>
199 :    
200 :     This is an <i>unrooted tree</i>, since we have no idea just looking at extant
201 :     sequences about where the root should lie.
202 :    
203 :     <h2>Some Random Facts that You Should Absorb</h2>
204 :    
205 :     Most genomes of bacteria contain between 400,000 and 12,000,000 characters.
206 :     Normally, the genes in a genome
207 :     cover abut 90% of the genome.
208 :     Normally, there is about one gene per 1000 characters in a bacterial genome.
209 :     <p>
210 :     So,
211 :     <ul>
212 :     <li> What is the length of the average protein sequence?
213 :     <li>How many genes do these
214 :     genomes have?
215 :     <li>What is the average length of a gene?
216 :     </ul>
217 :     <br>
218 :     It is worth spending just a short bit of time thinking about what types of cellular
219 :     machines must exist. Here are a few thoughts to start with
220 :     <ul>
221 :     <li>
222 :     There must be one or more machines that support replication of the cell. You would
223 :     need something to copy the genome, and you would need something that could build the DNA
224 :     bases that represent the characters (i.e., you will need machines to build the molecules
225 :     corresponding to each of the four characters in the alphabet of DNA bases.
226 :     <li>
227 :     As we mentioned, you have transport machines that take things into and out of the cell. Many
228 :     cells can import food in the form of sugar molecules. For example, many cells can import
229 :     <i>glucose</i> a six-carbon compound. As the compound gets broken down into smaller compounds,
230 :     energy is salvaged from the broken bonds to power the machines in the cell. The smaller compounds
231 :     are used as building blocks for other needs.
232 :     <li>
233 :     There must be one or more machines involved in building proteins from the descriptions in te genes.
234 :     In particular, we will need a machine for each of the amino acids (unless the cell can import some
235 :     of them).
236 :     <li>
237 :     There must be mechanisms for sensing what is going on in the environment and allowing the cell
238 :     to react to it. For example, many cells can "swim" towards food.
239 :     </ul>
240 :     Those were just a few examples. For any cell, we have many, many machines, and we still
241 :     do not even understand what some of them do.
242 :     <p>
243 :     About 50-60% of the genes occur within 5000 characters of another gene such that
244 :     the two genes encode proteins that are part f the same cellular machine. If you
245 :     had a genome in which the genes were identified, but the correspondence between the encoded
246 :     proteins and cellular machines was completely unknown, what could you learn using this fact?
247 :     Is the situation significantly different if you have 1000 genomes (let us say that
248 :     you know where the genes occur, but the correspondence between the proteins and cellular machines
249 :     is completely unknown in each case).
250 :     <p>
251 :     Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
252 :     a few cells. In these cases, the fused gene is (by definition) part of a single machine, and
253 :     in most cells in which the proteins are not fused, the two distinct proteins are separate components
254 :     of a single machine. How wuld you go about locating fused genes, and what could you learn from them?
255 :     <p>
256 :     Biologists have figured out the roles of about 50% of the genes. That is, they can
257 :     place the gene in a cellular machine, they know what the machine does, and they know
258 :     the specific role of the gene in sustaining the functionality of the machine.
259 :     <br><br>
260 :    
261 :     <h2>Imposing a Structure on Characterizing the Inventory</h2>
262 :    
263 :     One central goal of bioinformatics is to support an accurate characterization of the cellular
264 :     machinery for each cell. It is of major importance to biologsts that we be able to support
265 :     comparative analysis of cells. Perhaps, the most important aspect of understanding cells relates to
266 :     their origin in an evolutionary process. Cells have a long evolutionary history dating back billions of
267 :     years. The machines we see in cells today arose in the past, so we expect to see many current cells
268 :     using machinery that resembles what turns up in other cells. When we compare machines from different
269 :     cells they often look remarkably similar. On the other hand, those that had a common origin in a cell that existed
270 :     billions of years in the past may now have versions that are not very similar. Modifications, optimizations,
271 :     and insignificant alterations all combine to explore the space of operational possibilities for
272 :     each type of machine. Hence, we need a framework for studying similarities and differences in the
273 :     cellular machines and the proteins that implement them.
274 :     <p>
275 :    
276 :     Here is a short formulation of one way to do this:
277 :     <br><br>
278 :     <ul>
279 :     <li>A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.
280 :     <li>Each protein implements one or more functional roles. The set of functional roles
281 :     implemented by the protein is called the <b>function of the protein</b>. The function of a multifunctional
282 :     protein that implements {functional-role-1,functional-role-2} is normally written as
283 :     <i>functional-role-1 / functional-role-2</i>.
284 :     <br><br>
285 :     <li>A <b>populated subsystem</b> is a subsystem with an attached spreadsheet. Each column
286 :     in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to
287 :     a specific genome. Each cell in the spreadsheet contains the genes from the corresponding genome
288 :     that implement the designated functional role (there may be 0 or more such genes).
289 :     </ul>
290 :     <br><br>
291 :     We do not actually know what machines are present in a cell. We are in the midst of a grand
292 :     effort to clarify which are there and what they do. The formulation of subsystems as abstract machines
293 :     in which each row of the subsystem describes a specific cellular machine that is believed to be present,
294 :     represents a way to maintain a collection of estimates or assertions.
295 :     <p>
296 :     A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
297 :     are similar over the entire lengths of the proteins.
298 :     <p>
299 :     We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
300 :     The computational tasks imposed by such a goal are obvious:
301 :     <ul>
302 :     <li>We need to consruct databases that implement at least the following entities:
303 :     <ol>
304 :     <li>cells (i.e., each cell must have an ID and a set of attributes),
305 :     <li>genomes,
306 :     <li>genes,
307 :     <li>proteins,
308 :     <li>functional roles,
309 :     <li>subsystems, and
310 :     <li>protein families.
311 :     </ol>
312 :     <li> We need to add support for developing clues to function by integrating data
313 :     from sources like proximity within the genome, fusions, etc.
314 :     <li>We need to support a framework for the development of populated subsystems.
315 :     <li>We need to construct decision procedures for membership in protein families. Some
316 :     of these procedures will be quite complex, although the majority of cases can be
317 :     handled by fairly general procedures.
318 :     </ul>
319 :    
320 : overbeek 1.2 <h2>States of the Cell</h2>
321 : overbeek 1.1
322 : overbeek 1.2 The notion of <i>subsystem</i> was introduced as an <i>abstract machine</i> -- that is, as an
323 :     attempt to create a framework for understanding variations within specific celular machines via
324 :     a form of comparative analysis.
325 :    
326 :     In any specific cell, sets of specific cellular machines are
327 :     switched on and off as units. That is, they are <i>co-regulated</i>. We will call such a set
328 :     of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
329 :     a single cellular machine). A <b>state</b> of a cell will be defined
330 :     as the set of regulons that are operational at a point in time. Thus, a state amounts to the set
331 :     of cellular machines that are operational at one instant.
332 : overbeek 1.1 <p>
333 : overbeek 1.2 If we think of a car as a bag of machines that interact to make it function, we might consider there
334 :     to be a huge number of states. There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off. However, we can divide the states of a car into major groupings based on the status
335 :     of some key "machines". For example, "off" (the state in which the engine is turned off and the car is parked) and
336 :     "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into
337 :     two "major states".
338 : overbeek 1.1 <p>
339 : overbeek 1.2 Similarly, I believe that we should think about <i>major states of the cell</i> as being determined by the
340 :     functioning (or not) of a limited set of regulons. The determination of these regulons, the major states,
341 :     and how transitions between are managed all are now parts of the picture being filed in.
342 :    
343 :    
344 :     <h2>Microarrays</h2>
345 :    
346 :     Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
347 :     cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
348 :     second list contains genes that were "active" in the second but not the first. If a cellular
349 :     machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
350 :     only one cellular machine, then it would be reasonable to infer that you could say that the machine was
351 :     active in the first state, but not the second. If one knew the regulons for a specific cell, it would go
352 :     a long way to suport extraction of insights from these microarrays. On the other hand, if one had many,
353 :     many microarrays, and if the specific cellular machines for the cell are known, then one could make
354 :     substantial progress in uncovering the exact composition of the regulons that make up the cell.
355 : overbeek 1.1

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3