[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Diff of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Mon Oct 1 20:12:41 2007 UTC revision 1.3, Sat Oct 13 15:06:20 2007 UTC
# Line 9  Line 9 
9    
10  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
11  <p>  <p>
12  By the term <b>compound</b> I refer to the normal notion of chemical compound.  By the term <b>compound</b> we refer to the normal notion of chemical compound.
13  <p>  <p>
14    
15  A <b>cellular machine</b> is a set of proteins that together perform a function.   This function is often t  A <b>cellular machine</b> is a set of proteins that together perform a function. Unless otherwise noted,
16  transform a set of compounds into another set.  Some types of machines (transport machines)  when we use the term <i>machine</i> we will always be speaking of a cellular machine.
17    Many machines
18    transform one set of compounds into another set.  Some machines (transport machines)
19  are used to move compounds into  are used to move compounds into
20  or out of the cell.  or out of the cell.  Later we will try to convey a more comprehensive notion of what functions are implemented
21    by machines that we understand.
22  <p>  <p>
23    
24  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
# Line 56  Line 59 
59  </table>  </table>
60  <br><br>  <br><br>
61  <hr>  <hr>
62  This minimal notion of a cell is enough to explain some of the central  We will be speaking about organisms that are a single cell.  At some point life began on earth.
63    The single-celled organisms that we know of replicate producing copies of themselves that have
64    genomes which usually have very, very similar content to that of the parent cell.  <b>Evolution</b> is the
65    process in which cells replicate with some alterations in their genomes, are subjected to
66    <i>selective pressure</i>, and survive or not depending on many somewhat random factors.  The makeup of
67    cells (i.e., the genomes they contain and the machines that define what they are capable of doing)
68    changes gradually (and sometimes not so gradually) as time passes.
69    <p>
70    The original life forms that existed billions of years ago have evolved into three broad categories of
71    life forms.  That is, the evolutinary process led to early divisions, and these led to three main
72    categories of single-celled organisms.  We call these three forms the <b>archaea</b>,
73    the <b>bacteria</b>, and the <b>eukaryotes</b>.
74    A majority of the organisms for which we have acquired complete genomes are from the bacteria,
75    although the
76    numbers are rapidly growing for all three domains.
77    <p>
78    This minimal notion of a cell is enough to explain some of the basic
79  problems in bioinformatics:  problems in bioinformatics:
80    
81  <h3>Identify the genes within a genome</h3>  <h3>Identify the genes within a genome</h3>
82    
83  This problem simply involves taking a genome (a string of DNA) and locating  If we are to understand the contents of genomes, we will need to
84  the set of genes it contains.  Does the existence of 100s of genomes (genomes  locate the genes that occur in each genome.  This problem simply involves taking a genome (a
85  with at least some estimate of where the genes occur) effect how you might do this?  string of DNA) and locating the set of genes it contains.
86    In the case of bacteria and archaea, we know pretty well how to
87    locate the genes.
88    Once we
89    have identified instances from many genomes, it becomes possible to
90    recognize the genes in a new genome by just looking for things similar
91    to those we already understand.  The following problem is At the heart of reconizing when two
92    genes are "similar".
93    
94  <h3>Given two proteins. "align" them in a way that minimizes some edit function.  </h3>  <h3>Given two genes. "align" them in a way that minimizes some edit function.  </h3>
95    
96    For example, here is what you see when you align two genes from distinct organisms:
97    
 For example:  
 <br>  
 <br>  
98  <pre>  <pre>
99    
 seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT  
 seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE  
                                    ** *. :.:   .*: :**.:**..::***:*  :  :.  
100    
101  seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG  gene1           ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC
102  seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG  gene2           ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC
103                    **: :*.*   : :: .**:*:::* * *:* *  :: * .*:. *:.:: .******                     *   *  *    * * ***  ***    ** *  ****  * ***  **** * ***
104    
105  seq1            PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP  gene1           GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT
106  seq2            PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP  gene2           GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC
107                  *******.******::* ::    * * *.:*.*******:***:.* *: .::* :***                  *** **** **  ***     *** ** *  ****   *** ** *     * ***  *
108    
109  seq1            RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR  gene1           CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC
110  seq2            KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR  gene2           ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC
111                  :**:* :*** * *** ** :: :** ** ****** ** *:**:  * *.******::*                     ***** ***** *** ****  * ** **  *   ** ***      ****** ***
112    
113  seq1            LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND  gene1           ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG
114  seq2            FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND  gene2           ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG
115                  :*.*  *** * *** :  * : :*:.*******::****:.*.:****:*****.* **                  ********* **** ** ** **    ** *****  * ** ** ** ***   **   *
116    
117    gene1           GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC
118    gene2           GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC
119                    ** ** ***   ***********         **  ****  ***    ** *  * ***
120    
121    gene1           GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG
122    gene2           GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG
123                    ********   **     ***** ***** * * ** **            *   *****
124    
125    gene1           GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC
126    gene2           AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG
127                     *   **** * *********   ******* ***        * ***   ********
128    
129    gene1           GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA
130    gene2           GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA
131                    *  ********    ***   ***  ***     * ** ** * * **   ** * ****
132  </pre>  </pre>
133    <hr>
134    
135    The sequences are recognizably similar, and in fact implement exactly the same function
136    in the two cells.  If we align the protein sequences corresponding to these two
137    genes, we get
138    
139    <pre>
140    gene1           MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN
141    gene2           -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN
142                      :*  * * :::*:* ****:**:. *:: * **:  *: *:***:*:***.:*  ***
143    
144    gene1           IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT
145    gene2           IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E
146                    ***:****.***:****:: *:* ****..  *.*::.:****.: ***: .:
147    
148    gene1           VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR
149    gene2           TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ
150                    .:***** **.:  :*.*** **::*:* * :**:.:*:
151    </pre>
152    
153    There is a great deal of work relating to recognizing when two sequences are
154    similar and whether or not they had a common ancestor.  Understanding why
155    selective pressure conserves sections of sequences, but not others, will yield
156    important clues.  Can you reason out why some sections might be conserved, while
157    others vary wildly?
158    <p>
159    
160    Comparing sets of sequences that have retained the same function is
161    at the heart of understanding cellular machines and the proteins that implement them.
162    We find that looking at sets (often with more than two sequences) and aligning them
163    is important.
164    
 shows an alignment of two proteins (called <i>seq1</i> and <i>seq2</i>).  
165    
166  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>
167    
# Line 170  Line 238 
238    
239  <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>  <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>
240    
241  Here is one reasonable tree for the last 5 sequences.  Note that we now have alignments that  From the extant five sequences that are similar and displayed in the previous alignment, we can construct
242  contain thousands of sequences, and even displaying such trees is nontrivial.  a tree that depicts the "phylogenetic history" of the sequences.
243    Here is one reasonable tree for the last 5 sequences.
244    
245  <pre>  <pre>
246                       ,--------------------------------------------------- seq1                       ,--------------------------------------------------- seq1
247                       |                       |
# Line 183  Line 253 
253    |    |
254    |    |
255    |    |
256    |    ,----|
   |  
   |             ,-------------------------------- seq3  
   |             |  
257    |             |    |             |
258    |-------------|    |    |             ,-------------------------------- seq3
259      |    |             |
260      |    |             |
261      |    |-------------|
262    |             |    |             |
263    |             |    |             |
264    |             `------------------------------ seq4    |             `------------------------------ seq4
# Line 197  Line 267 
267    `---------------------------------------------- seq5    `---------------------------------------------- seq5
268  </pre>  </pre>
269    
270  This is an <i>unrooted tree</i>, since we have no idea just looking at extant  The tree suggests that at some point an ancestral
271  sequences about where the root should lie.  cell replicated.  One copy led (through a chain of descendants) to <b>seq5</b>, while the remaining sequences descend
272    from the ther copy.
273    <p>
274    Note that we now have alignments that
275    contain thousands of sequences, and even displaying such trees is nontrivial.
276    Because evolution plays such a central role in the phenomena we study, the construction of alignments
277    and trees in order to compare extant versions of proteins and gain insight into their historical origins
278    is considered basic to the task at hand.
279    
280  <h2>Some Random Facts that You Should Absorb</h2>  <h2>Some Random Facts that You Should Absorb</h2>
281    
# Line 215  Line 292 
292  <li>What is the average length of a gene?  <li>What is the average length of a gene?
293  </ul>  </ul>
294  <br>  <br>
295  It is worth spending just a short bit of time thinking about what types of cellular  It is worth spending just a short bit of time thinking about what types of
296  machines must exist.  Here are a few thoughts to start with  machines must exist in each cell.  Here are a few thoughts to start with
297  <ul>  <ul>
298  <li>  <li>
299  There must be one or more machines that support replication of the cell.  You would  There must be one or more machines that support replication of the cell.  You would
# Line 238  Line 315 
315  to react to it.  For example, many cells can "swim" towards food.  to react to it.  For example, many cells can "swim" towards food.
316  </ul>  </ul>
317  Those were just a few examples.  For any cell, we have many, many machines, and we still  Those were just a few examples.  For any cell, we have many, many machines, and we still
318  do not even understand what some of them do.  do not even understand what some of them do.  Later, we will try to offer a more structured
319    estimate of what is already known.
320  <p>  <p>
321  About 50-60% of the genes occur within 5000 characters of another gene such that  About 50-60% of the genes occur within 5000 characters of another gene such that
322  the two genes encode proteins that are part f the same cellular machine.  If you  the two genes encode proteins that are part of the same cellular machine.  This fact
323  had a genome in which the genes were identified, but the correspondence between the encoded  suggests that just having a large number of genomes would enable a person to group
324  proteins and cellular machines was completely unknown, what could you learn using this fact?  the genes into the machines they implement, without the person understanding the functions
325  Is the situation significantly different if you have 1000 genomes (let us say that  of the machines or the roles played by each protein.
 you know where the genes occur, but the correspondence between the proteins and cellular machines  
 is completely unknown in each case).  
326  <p>  <p>
327  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
328  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and
329  in most cells in which the proteins are not fused, the two distinct proteins are separate components  in most cells in which the proteins are not fused, the two distinct proteins are separate components
330  of a single machine.  How wuld you go about locating fused genes, and what could you learn from them?  of a single machine.  This, too, offers clues to support analysis of which proteins go with which machines.
331  <p>  <p>
332  Biologists have figured out the roles of about 50% of the genes.  That is, they can  Biologists have figured out the roles of about 50% of the genes.  That is, they can
333  place the gene in a cellular machine, they know what the machine does, and they know  place the gene in a cellular machine, they know what the machine does, and they know
# Line 317  Line 393 
393  handled by fairly general procedures.  handled by fairly general procedures.
394  </ul>  </ul>
395    
396  <h2>Microarrays and States of the Cell</h2>  <h2>States of the Cell</h2>
397    
398  We wil think of a <b>regulon</b> as a set of subsystems.  A <b>state of the cell</b> is  The notion of <i>subsystem</i> was introduced as an <i>abstract machine</i> -- that is, as an
399  defined as the set of regulons that are operational at a point in time.  attempt to create a framework for understanding variations within specific celular machines via
400  <p>  a form of comparative analysis.
401  A <b>consistent microarray</b> (for our purposes) is  
402  <ol>  In any specific cell, sets of specific cellular machines are
403  <li> The ID of an experiment.  The experiment corresponds to two states of the cell, S1 and S2.  switched on and off as units.  That is, they are <i>co-regulated</i>.  We will call such a set
404  <li> A list of proteins that are in in the regulons in S1, but not in those of S2.  of <i>co-regulated cellular machines</i> a <b>regulon</b> (note that a regulon is often a set containing
405  <li> A list of proteins that are in the regulons of S2, but not in those of S1.  a single cellular machine).  A <b>state</b> of a cell will be defined
406  </ol>  as the set of regulons that are operational at a point in time.  Thus, a state amounts to the set
407  <br>  of cellular machines that are operational at one instant.
408  A <b>real microarray</b> is just two sets of proteins.  We have some notion (e.g., an ID) for  <p>
409  each of two states of the cell, but no idea what regulons make up these states.  There is a  If we think of a car as a bag of machines that interact to make it function, we might consider there
410  substantial error rate in the two lists of proteins (e.g., some of the proteins in the first list either were not in  to be a huge number of states.  There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off.  However, we can divide the states of a car into major groupings based on the status
411  S1 or they were in S2).  of some key "machines".  For example, "off" (the state in which the engine is turned off and the car is parked) and
412  <p>  "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into
413  The interesting research question is, given a large list of real microarrays, can you  two "major states".
414  attach sets of regulons to all of the state IDs, and then give a minimal set of changes to the data  <p>
415  in the real microarrays needed to convert them to consistent microarrays.  Similarly, I believe that we should think about <i>major states of the cell</i> as being determined by the
416    functioning (or not) of a limited set of regulons.  The determination of these regulons, the major states,
417    and how transitions between are managed all are now parts of the picture being filed in.
418    
419    
420    <h2>Microarrays</h2>
421    
422    Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a
423    cell.  Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the
424    second list contains genes that were "active" in the second but not the first.  If a cellular
425    machine utilizes protein <i>X</i>, and <i>X</i> is in the first list, and if <i>X</i> is used in
426    only one cellular machine, then it would be reasonable to infer that you could say that the machine was
427    active in the first state, but not the second.  If one knew the regulons for a specific cell, it would go
428    a long way to suport extraction of insights from these microarrays.  On the other hand, if one had many,
429    many microarrays, and if the specific cellular machines for the cell are known, then one could make
430    substantial progress in uncovering the exact composition of the regulons that make up the cell.
431    

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.3

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3