[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Annotation of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (view) (download) (as text)

1 : overbeek 1.1 <div align=center>
2 :     <h1>The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:</h1>
3 :     <br>
4 :     <h1>An Abstract View</h1>
5 :     <h2>by Ross Overbeek</h2>
6 :     </div>
7 :    
8 :     <h2>What Is a Cell?</h2>
9 :    
10 :     A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
11 :     <p>
12 : overbeek 1.3 By the term <b>compound</b> we refer to the normal notion of chemical compound.
13 : overbeek 1.1 <p>
14 :    
15 : overbeek 1.3 A <b>cellular machine</b> is a set of proteins that together perform a function. Unless otherwise noted,
16 :     when we use the term <i>machine</i> we will always be speaking of a cellular machine.
17 :     Many machines
18 :     transform one set of compounds into another set. Some machines (transport machines)
19 : overbeek 1.1 are used to move compounds into
20 : overbeek 1.3 or out of the cell. Later we will try to convey a more comprehensive notion of what functions are implemented
21 :     by machines that we understand.
22 : overbeek 1.1 <p>
23 :    
24 :     A <b>protein</b> is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
25 :     <p>
26 :    
27 :     A <b>genome</b> is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).
28 :     <p>
29 :    
30 :     A <b>gene</b> is a region in the genome that describes how to build a
31 :     protein. The description is a sequence of 3-character codons. Each
32 :     codon corresponds to either a single amino acid or a stop codon.
33 :     There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
34 :     table of correspondences between codons and amino acids:
35 :     <br><br>
36 :     <table border>
37 :     <tr><th>Amino Acid</th><th>Codons</th></tr>
38 :     <tr><td>A</td> <td>GCT, GCC, GCA, GCG </td></tr>
39 :     <tr><td>C</td> <td>TGT, TGC</td></tr>
40 :     <tr><td>D</td> <td>GAT, GAC</td></tr>
41 :     <tr><td>E</td> <td>GAA, GAG</td></tr>
42 :     <tr><td>F</td> <td>TTT, TTC</td></tr>
43 :     <tr><td>G</td> <td>GGT, GGC, GGA, GGG</td></tr>
44 :     <tr><td>H</td> <td>CAT, CAC</td></tr>
45 :     <tr><td>I</td> <td>ATT, ATC, ATA</td></tr>
46 :     <tr><td>K</td> <td>AAA, AAG</td></tr>
47 :     <tr><td>L</td> <td>TTA, TTG, CTT, CTC, CTA, CTG</td></tr>
48 :     <tr><td>M</td> <td>ATG</td></tr>
49 :     <tr><td>N</td> <td>AAT, AAC</td></tr>
50 :     <tr><td>P</td> <td>CCT, CCC, CCA, CCG</td></tr>
51 :     <tr><td>Q</td> <td>CAA, CAG</td></tr>
52 :     <tr><td>R</td> <td>CGT, CGC, CGA, CGG, AGA, AGG</td></tr>
53 :     <tr><td>S</td> <td>TCT, TCC, TCA, TCG, AGT, AGC</td></tr>
54 :     <tr><td>T</td> <td>ACT, ACC, ACA, ACG</td></tr>
55 :     <tr><td>V</td> <td>GTT, GTC, GTA, GTG</td></tr>
56 :     <tr><td>W</td> <td>TGG</td></tr>
57 :     <tr><td>Y</td> <td>TAT, TAC</td></tr>
58 :     <tr><td>*</td> <td>TAG, TGA, TAA [Stop codons]</td></tr>
59 :     </table>
60 :     <br><br>
61 :     <hr>
62 : overbeek 1.3 We will be speaking about organisms that are a single cell. At some point life began on earth.
63 :     The single-celled organisms that we know of replicate producing copies of themselves that have
64 :     genomes which usually have very, very similar content to that of the parent cell. <b>Evolution</b> is the
65 :     process in which cells replicate with some alterations in their genomes, are subjected to
66 :     <i>selective pressure</i>, and survive or not depending on many somewhat random factors. The makeup of
67 :     cells (i.e., the genomes they contain and the machines that define what they are capable of doing)
68 :     changes gradually (and sometimes not so gradually) as time passes.
69 :     <p>
70 :     The original life forms that existed billions of years ago have evolved into three broad categories of
71 :     life forms. That is, the evolutinary process led to early divisions, and these led to three main
72 :     categories of single-celled organisms. We call these three forms the <b>archaea</b>,
73 :     the <b>bacteria</b>, and the <b>eukaryotes</b>.
74 :     A majority of the organisms for which we have acquired complete genomes are from the bacteria,
75 :     although the
76 :     numbers are rapidly growing for all three domains.
77 :     <p>
78 :     This minimal notion of a cell is enough to explain some of the basic
79 : overbeek 1.1 problems in bioinformatics:
80 :    
81 :     <h3>Identify the genes within a genome</h3>
82 :    
83 : overbeek 1.3 If we are to understand the contents of genomes, we will need to
84 :     locate the genes that occur in each genome. This problem simply involves taking a genome (a
85 :     string of DNA) and locating the set of genes it contains.
86 :     In the case of bacteria and archaea, we know pretty well how to
87 :     locate the genes.
88 :     Once we
89 :     have identified instances from many genomes, it becomes possible to
90 :     recognize the genes in a new genome by just looking for things similar
91 :     to those we already understand. The following problem is At the heart of reconizing when two
92 :     genes are "similar".
93 :    
94 :     <h3>Given two genes. "align" them in a way that minimizes some edit function. </h3>
95 : overbeek 1.1
96 : overbeek 1.3 For example, here is what you see when you align two genes from distinct organisms:
97 : overbeek 1.1
98 :     <pre>
99 :    
100 :    
101 : overbeek 1.3 gene1 ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC
102 :     gene2 ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC
103 :     * * * * * *** *** ** * **** * *** **** * ***
104 :    
105 :     gene1 GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT
106 :     gene2 GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC
107 :     *** **** ** *** *** ** * **** *** ** * * *** *
108 :    
109 :     gene1 CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC
110 :     gene2 ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC
111 :     ***** ***** *** **** * ** ** * ** *** ****** ***
112 :    
113 :     gene1 ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG
114 :     gene2 ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG
115 :     ********* **** ** ** ** ** ***** * ** ** ** *** ** *
116 :    
117 :     gene1 GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC
118 :     gene2 GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC
119 :     ** ** *** *********** ** **** *** ** * * ***
120 :    
121 :     gene1 GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG
122 :     gene2 GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG
123 :     ******** ** ***** ***** * * ** ** * *****
124 :    
125 :     gene1 GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC
126 :     gene2 AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG
127 :     * **** * ********* ******* *** * *** ********
128 :    
129 :     gene1 GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA
130 :     gene2 GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA
131 :     * ******** *** *** *** * ** ** * * ** ** * ****
132 :     </pre>
133 :     <hr>
134 :    
135 :     The sequences are recognizably similar, and in fact implement exactly the same function
136 :     in the two cells. If we align the protein sequences corresponding to these two
137 :     genes, we get
138 :    
139 :     <pre>
140 :     gene1 MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN
141 :     gene2 -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN
142 :     :* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***
143 :    
144 :     gene1 IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT
145 :     gene2 IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E
146 :     ***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:
147 :    
148 :     gene1 VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR
149 :     gene2 TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ
150 :     .:***** **.: :*.*** **::*:* * :**:.:*:
151 : overbeek 1.1 </pre>
152 :    
153 : overbeek 1.3 There is a great deal of work relating to recognizing when two sequences are
154 :     similar and whether or not they had a common ancestor. Understanding why
155 :     selective pressure conserves sections of sequences, but not others, will yield
156 :     important clues. Can you reason out why some sections might be conserved, while
157 :     others vary wildly?
158 :     <p>
159 :    
160 :     Comparing sets of sequences that have retained the same function is
161 :     at the heart of understanding cellular machines and the proteins that implement them.
162 :     We find that looking at sets (often with more than two sequences) and aligning them
163 :     is important.
164 :    
165 : overbeek 1.1
166 :     <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>
167 :    
168 :     Here is an example of a multiple sequence alignment:
169 :     <br>
170 :     <br>
171 :     <pre>
172 :     CLUSTAL W (1.83) multiple sequence alignment
173 :    
174 :    
175 :     seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
176 :     seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
177 :     seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
178 :     seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
179 :     seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
180 :     *. . . .: :.: **..: ** .* : :
181 :    
182 :     seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
183 :     seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
184 :     seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
185 :     seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
186 :     seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL
187 :     * : : .: :. .. :.. : : .*: . *
188 :    
189 :     seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
190 :     seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
191 :     seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
192 :     seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
193 :     seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV
194 :     ************.. :::..:: . * : : :: *******:*. . :.:
195 :    
196 :     seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
197 :     seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
198 :     seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
199 :     seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
200 :     seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA
201 :     :::*:.: * :* : *: .: : * ** ** :** * * : * :.
202 :    
203 :     seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
204 :     seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
205 :     seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
206 :     seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
207 :     seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI
208 :     **** :*.:.* *** * :: . . .**:****:: ** .. :***: :::
209 :    
210 :     seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
211 :     seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
212 :     seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
213 :     seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
214 :     seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------
215 :     *.* *: . : : . : ::: :**: ..*: * : *.
216 :    
217 :     seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
218 :     seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
219 :     seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
220 :     seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
221 :     seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG
222 :     . : : :** * . * :
223 :    
224 :     seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
225 :     seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
226 :     seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
227 :     seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
228 :     seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS
229 :     : * ****.** ::: :* * * : : .:
230 :    
231 :     seq3 FVSQHGNRGKPL
232 :     seq4 FMSGHLGA----
233 :     seq5 FIEKKAL-----
234 :     seq1 LMMNHQ------
235 :     seq2 YLLGK-------
236 :     : :
237 :     </pre>
238 :    
239 :     <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>
240 :    
241 : overbeek 1.3 From the extant five sequences that are similar and displayed in the previous alignment, we can construct
242 :     a tree that depicts the "phylogenetic history" of the sequences.
243 :     Here is one reasonable tree for the last 5 sequences.
244 :    
245 : overbeek 1.1 <pre>
246 : overbeek 1.3 ,--------------------------------------------------- seq1
247 :     |
248 :     |
249 :     ,------------------|
250 :     | |
251 :     | |
252 :     | `---------------------------------------------- seq2
253 :     |
254 :     |
255 :     |
256 :     ,----|
257 :     | |
258 :     | | ,-------------------------------- seq3
259 :     | | |
260 :     | | |
261 :     | |-------------|
262 : overbeek 1.1 | |
263 :     | |
264 : overbeek 1.3 | `------------------------------ seq4
265 : overbeek 1.1 |
266 :     |
267 :     `---------------------------------------------- seq5
268 :     </pre>
269 :    
270 : overbeek 1.3 The tree suggests that at some point an ancestral
271 :     cell replicated. One copy led (through a chain of descendants) to <b>seq5</b>, while the remaining sequences descend
272 :     from the ther copy.
273 :     <p>
274 :     Note that we now have alignments that
275 :     contain thousands of sequences, and even displaying such trees is nontrivial.
276 :     Because evolution plays such a central role in the phenomena we study, the construction of alignments
277 :     and trees in order to compare extant versions of proteins and gain insight into their historical origins
278 :     is considered basic to the task at hand.
279 : overbeek 1.1
280 :     <h2>Some Random Facts that You Should Absorb</h2>
281 :    
282 :     Most genomes of bacteria contain between 400,000 and 12,000,000 characters.
283 :     Normally, the genes in a genome
284 :     cover abut 90% of the genome.
285 :     Normally, there is about one gene per 1000 characters in a bacterial genome.
286 :     <p>
287 :     So,
288 :     <ul>
289 :     <li> What is the length of the average protein sequence?
290 :     <li>How many genes do these
291 :     genomes have?
292 :     <li>What is the average length of a gene?
293 :     </ul>
294 :     <br>
295 : overbeek 1.3 It is worth spending just a short bit of time thinking about what types of
296 :     machines must exist in each cell. Here are a few thoughts to start with
297 : overbeek 1.1 <ul>
298 :     <li>
299 :     There must be one or more machines that support replication of the cell. You would
300 :     need something to copy the genome, and you would need something that could build the DNA
301 :     bases that represent the characters (i.e., you will need machines to build the molecules
302 :     corresponding to each of the four characters in the alphabet of DNA bases.
303 :     <li>
304 :     As we mentioned, you have transport machines that take things into and out of the cell. Many
305 :     cells can import food in the form of sugar molecules. For example, many cells can import
306 :     <i>glucose</i> a six-carbon compound. As the compound gets broken down into smaller compounds,
307 :     energy is salvaged from the broken bonds to power the machines in the cell. The smaller compounds
308 :     are used as building blocks for other needs.
309 :     <li>
310 :     There must be one or more machines involved in building proteins from the descriptions in te genes.
311 :     In particular, we will need a machine for each of the amino acids (unless the cell can import some
312 :     of them).
313 :     <li>
314 :     There must be mechanisms for sensing what is going on in the environment and allowing the cell
315 :     to react to it. For example, many cells can "swim" towards food.
316 :     </ul>
317 :     Those were just a few examples. For any cell, we have many, many machines, and we still
318 : overbeek 1.3 do not even understand what some of them do. Later, we will try to offer a more structured
319 :     estimate of what is already known.
320 : overbeek 1.1 <p>
321 :     About 50-60% of the genes occur within 5000 characters of another gene such that
322 : overbeek 1.3 the two genes encode proteins that are part of the same cellular machine. This fact
323 :     suggests that just having a large number of genomes would enable a person to group
324 :     the genes into the machines they implement, without the person understanding the functions
325 :     of the machines or the roles played by each protein.
326 : overbeek 1.1 <p>
327 :     Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
328 :     a few cells. In these cases, the fused gene is (by definition) part of a single machine, and
329 :     in most cells in which the proteins are not fused, the two distinct proteins are separate components
330 : overbeek 1.3 of a single machine. This, too, offers clues to support analysis of which proteins go with which machines.
331 : overbeek 1.1 <p>
332 :     Biologists have figured out the roles of about 50% of the genes. That is, they can
333 :     place the gene in a cellular machine, they know what the machine does, and they know
334 :     the specific role of the gene in sustaining the functionality of the machine.
335 :     <br><br>
336 :    
337 :     <h2>Imposing a Structure on Characterizing the Inventory</h2>
338 :    
339 :     One central goal of bioinformatics is to support an accurate characterization of the cellular
340 :     machinery for each cell. It is of major importance to biologsts that we be able to support
341 :     comparative analysis of cells. Perhaps, the most important aspect of understanding cells relates to
342 :     their origin in an evolutionary process. Cells have a long evolutionary history dating back billions of
343 :     years. The machines we see in cells today arose in the past, so we expect to see many current cells
344 :     using machinery that resembles what turns up in other cells. When we compare machines from different
345 :     cells they often look remarkably similar. On the other hand, those that had a common origin in a cell that existed
346 :     billions of years in the past may now have versions that are not very similar. Modifications, optimizations,
347 :     and insignificant alterations all combine to explore the space of operational possibilities for
348 :     each type of machine. Hence, we need a framework for studying similarities and differences in the
349 :     cellular machines and the proteins that implement them.
350 :     <p>
351 :    
352 :     Here is a short formulation of one way to do this:
353 :     <br><br>
354 :     <ul>
355 :     <li>A <b>subsystem</b> (i.e., an abstract cellular machine) is a set of functional roles.
356 :     <li>Each protein implements one or more functional roles. The set of functional roles
357 :     implemented by the protein is called the <b>function of the protein</b>. The function of a multifunctional
358 :     protein that implements {functional-role-1,functional-role-2} is normally written as
359 :     <i>functional-role-1 / functional-role-2</i>.
360 :     <br><br>
361 :     <li>A <b>populated subsystem</b> is a subsystem with an attached spreadsheet. Each column
362 :     in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to
363 :     a specific genome. Each cell in the spreadsheet contains the genes from the corresponding genome
364 :     that implement the designated functional role (there may be 0 or more such genes).
365 :     </ul>
366 :     <br><br>
367 :     We do not actually know what machines are present in a cell. We are in the midst of a grand
368 :     effort to clarify which are there and what they do. The formulation of subsystems as abstract machines
369 :     in which each row of the subsystem describes a specific cellular machine that is believed to be present,
370 :     represents a way to maintain a collection of estimates or assertions.
371 :     <p>
372 :     A <b>protein family</b> is defined to be a set of proteins that implement the same functional roles and
373 :     are similar over the entire lengths of the proteins.
374 :     <p>
375 :     We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.
376 :     The computational tasks imposed by such a goal are obvious:
377 :     <ul>
378 :     <li>We need to consruct databases that implement at least the following entities:
379 :     <ol>
380 :     <li>cells (i.e., each cell must have an ID and a set of attributes),
381 :     <li>genomes,
382 :     <li>genes,
383 :     <li>proteins,
384 :     <li>functional roles,
385 :     <li>subsystems, and
386 :     <li>protein families.
387 :     </ol>
388 :     <li> We need to add support for developing clues to function by integrating data
389 :     from sources like proximity within the genome, fusions, etc.
390 :     <li>We need to support a framework for the development of populated subsystems.
391 :     <li>We need to construct decision procedures for membership in protein families. Some
392 :     of these procedures will be quite complex, although the majority of cases can be
393 :     handled by fairly general procedures.
394 :     </ul>
395 :    
396 : overbeek 1.2 <h2>States of the Cell</h2>
397 : overbeek 1.1
398 : overbeek 1.2 The notion of <i>subsystem</i> was introduced as an <i>abstract machine</i> -- that is, as an
399 :     attempt to create a framework for understanding variations within specific celular machines via
400 :     a form of comparative analysis.
401 :    
402 :     In any specific cell, sets of specific cellular machines are
403 :     switched on and off as units. That is, they are <i>co-regulated</i>. We will call such a set
404 :     of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
405 :     a single cellular machine). A <b>state</b> of a cell will be defined
406 :     as the set of regulons that are operational at a point in time. Thus, a state amounts to the set
407 :     of cellular machines that are operational at one instant.
408 : overbeek 1.1 <p>
409 : overbeek 1.2 If we think of a car as a bag of machines that interact to make it function, we might consider there
410 :     to be a huge number of states. There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off. However, we can divide the states of a car into major groupings based on the status
411 :     of some key "machines". For example, "off" (the state in which the engine is turned off and the car is parked) and
412 :     "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into
413 :     two "major states".
414 : overbeek 1.1 <p>
415 : overbeek 1.2 Similarly, I believe that we should think about <i>major states of the cell</i> as being determined by the
416 :     functioning (or not) of a limited set of regulons. The determination of these regulons, the major states,
417 :     and how transitions between are managed all are now parts of the picture being filed in.
418 :    
419 :    
420 :     <h2>Microarrays</h2>
421 :    
422 :     Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
423 :     cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
424 :     second list contains genes that were "active" in the second but not the first. If a cellular
425 :     machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
426 :     only one cellular machine, then it would be reasonable to infer that you could say that the machine was
427 :     active in the first state, but not the second. If one knew the regulons for a specific cell, it would go
428 :     a long way to suport extraction of insights from these microarrays. On the other hand, if one had many,
429 :     many microarrays, and if the specific cellular machines for the cell are known, then one could make
430 :     substantial progress in uncovering the exact composition of the regulons that make up the cell.
431 : overbeek 1.1

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3