[Bio] / FigTutorial / tut_abs.html Repository:
ViewVC logotype

Diff of /FigTutorial/tut_abs.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.2, Tue Oct 9 21:19:55 2007 UTC revision 1.3, Sat Oct 13 15:06:20 2007 UTC
# Line 9  Line 9 
9    
10  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.  A <b>cell</b> is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.
11  <p>  <p>
12  By the term <b>compound</b> I refer to the normal notion of chemical compound.  By the term <b>compound</b> we refer to the normal notion of chemical compound.
13  <p>  <p>
14    
15  A <b>cellular machine</b> is a set of proteins that together perform a function.   This function is often t  A <b>cellular machine</b> is a set of proteins that together perform a function. Unless otherwise noted,
16  transform a set of compounds into another set.  Some types of machines (transport machines)  when we use the term <i>machine</i> we will always be speaking of a cellular machine.
17    Many machines
18    transform one set of compounds into another set.  Some machines (transport machines)
19  are used to move compounds into  are used to move compounds into
20  or out of the cell.  or out of the cell.  Later we will try to convey a more comprehensive notion of what functions are implemented
21    by machines that we understand.
22  <p>  <p>
23    
24  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).  A <b>protein</b> is a string of amino acids (i.e., a string in the  20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
# Line 56  Line 59 
59  </table>  </table>
60  <br><br>  <br><br>
61  <hr>  <hr>
62  This minimal notion of a cell is enough to explain some of the central  We will be speaking about organisms that are a single cell.  At some point life began on earth.
63    The single-celled organisms that we know of replicate producing copies of themselves that have
64    genomes which usually have very, very similar content to that of the parent cell.  <b>Evolution</b> is the
65    process in which cells replicate with some alterations in their genomes, are subjected to
66    <i>selective pressure</i>, and survive or not depending on many somewhat random factors.  The makeup of
67    cells (i.e., the genomes they contain and the machines that define what they are capable of doing)
68    changes gradually (and sometimes not so gradually) as time passes.
69    <p>
70    The original life forms that existed billions of years ago have evolved into three broad categories of
71    life forms.  That is, the evolutinary process led to early divisions, and these led to three main
72    categories of single-celled organisms.  We call these three forms the <b>archaea</b>,
73    the <b>bacteria</b>, and the <b>eukaryotes</b>.
74    A majority of the organisms for which we have acquired complete genomes are from the bacteria,
75    although the
76    numbers are rapidly growing for all three domains.
77    <p>
78    This minimal notion of a cell is enough to explain some of the basic
79  problems in bioinformatics:  problems in bioinformatics:
80    
81  <h3>Identify the genes within a genome</h3>  <h3>Identify the genes within a genome</h3>
82    
83  This problem simply involves taking a genome (a string of DNA) and locating  If we are to understand the contents of genomes, we will need to
84  the set of genes it contains.  Does the existence of 100s of genomes (genomes  locate the genes that occur in each genome.  This problem simply involves taking a genome (a
85  with at least some estimate of where the genes occur) effect how you might do this?  string of DNA) and locating the set of genes it contains.
86    In the case of bacteria and archaea, we know pretty well how to
87    locate the genes.
88    Once we
89    have identified instances from many genomes, it becomes possible to
90    recognize the genes in a new genome by just looking for things similar
91    to those we already understand.  The following problem is At the heart of reconizing when two
92    genes are "similar".
93    
94  <h3>Given two proteins. "align" them in a way that minimizes some edit function.  </h3>  <h3>Given two genes. "align" them in a way that minimizes some edit function.  </h3>
95    
96    For example, here is what you see when you align two genes from distinct organisms:
97    
 For example:  
 <br>  
 <br>  
98  <pre>  <pre>
99    
 seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT  
 seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE  
                                    ** *. :.:   .*: :**.:**..::***:*  :  :.  
100    
101  seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG  gene1           ATGGCTGATTTATTCGCATTGACCGAAGAAGCGTTGGCGGGCATGGGCATCGAGTTGGTC
102  seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG  gene2           ---GTGCAACTGACGGAACTGATAGAAACTACGGTCACGGGGCTCGGCTACGAGCTCGTC
103                    **: :*.*   : :: .**:*:::* * *:* *  :: * .*:. *:.:: .******                     *   *  *    * * ***  ***    ** *  ****  * ***  **** * ***
104    
105  seq1            PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP  gene1           GATGTCGAACGTGCCGCCTTAGGCTTGTTGCGCGTGACCATAGACCGTGAGGACGGTGTT
106  seq2            PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP  gene2           GATCTCGAGCGCACCGGGCGCGGCATGGTCTGCGTCTACATCGATCAGCCCGCCGGCATC
107                  *******.******::* ::    * * *.:*.*******:***:.* *: .::* :***                  *** **** **  ***     *** ** *  ****   *** ** *     * ***  *
108    
109  seq1            RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR  gene1           CGCATCGAAGATTGTGAGCAGGTGTCCCGGCAATTGTCGCGCGTCTACGAGGTCGAGAAC
110  seq2            KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR  gene2           ACGATCGACGATTGCGAGAAGGTCACGCGTCAGCTCCAGCACGTACTGACGGTCGAAAAC
111                  :**:* :*** * *** ** :: :** ** ****** ** *:**:  * *.******::*                     ***** ***** *** ****  * ** **  *   ** ***      ****** ***
112    
113  seq1            LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND  gene1           ATCGATTACAAACGTCTGGAAGTTGGCTCGCCGGGCGTGGATCGCCCCTTGCGCAACGAG
114  seq2            FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND  gene2           ATCGATTACGAACGGCTCGAGGTCTCGTCACCGGGGCTCGACCGGCCGTTGAAGAAGCTG
115                  :*.*  *** * *** :  * : :*:.*******::****:.*.:****:*****.* **                  ********* **** ** ** **    ** *****  * ** ** ** ***   **   *
116    
117    gene1           GCGGAATTCCGTCGTTTCGCGGGTGAACGTATCGAGATCAAGCTGCGTGAGGCAGTCGAC
118    gene2           GCTGACTTCACGCGTTTCGCGGGCAGCGAGGCCGTCATCACCCTGAAAAAGCCGTTGGAC
119                    ** ** ***   ***********         **  ****  ***    ** *  * ***
120    
121    gene1           GGGCGCAAAGTGTTTACCGGCATCCTGCAAGAGGCGGACACGTCTGCTGACGATAAGACG
122    gene2           GGGCGCAAGACGTACCGGGGCATTCTGCACGCGCCGAAC------------GGCGAGACG
123                    ********   **     ***** ***** * * ** **            *   *****
124    
125    gene1           GTGTTCGGTCTCGAATTTGAGGCAAAGAAGGACGATATTCAGGTACTGAGCTTCACGCTC
126    gene2           AT---CGGTTTGGAATTTGAGAGGAAGAAGGGCGAGGCGGCCATGCTGGATTTCACGCTG
127                     *   **** * *********   ******* ***        * ***   ********
128    
129    gene1           GATGACATCGAGCGCGCCAAGCTGGATCCCGTTCTGGATTTCAAGGGCAAAAAGCGATGA
130    gene2           GCGGACATCGACAAGGCCCGCCTGATTCCGCACGTTGACTTTAGGAGCCGCAAACAATGA
131                    *  ********    ***   ***  ***     * ** ** * * **   ** * ****
132  </pre>  </pre>
133    <hr>
134    
135    The sequences are recognizably similar, and in fact implement exactly the same function
136    in the two cells.  If we align the protein sequences corresponding to these two
137    genes, we get
138    
139    <pre>
140    gene1           MADLFALTEEALAGMGIELVDVERAALGLLRVTIDREDGVRIEDCEQVSRQLSRVYEVEN
141    gene2           -VQLTELIETTVTGLGYELVDLERTGRGMVCVYIDQPAGITIDDCEKVTRQLQHVLTVEN
142                      :*  * * :::*:* ****:**:. *:: * **:  *: *:***:*:***.:*  ***
143    
144    gene1           IDYKRLEVGSPGVDRPLRNEAEFRRFAGERIEIKLREAVDGRKVFTGILQEADTSADDKT
145    gene2           IDYERLEVSSPGLDRPLKKLADFTRFAGSEAVITLKKPLDGRKTYRGILHAPNG-----E
146                    ***:****.***:****:: *:* ****..  *.*::.:****.: ***: .:
147    
148    gene1           VFGLEFEAKKDDIQVLSFTLDDIERAKLDPVLDFKGKKR
149    gene2           TIGLEFERKKGEAAMLDFTLADIDKARLIPHVDFRSRKQ
150                    .:***** **.:  :*.*** **::*:* * :**:.:*:
151    </pre>
152    
153    There is a great deal of work relating to recognizing when two sequences are
154    similar and whether or not they had a common ancestor.  Understanding why
155    selective pressure conserves sections of sequences, but not others, will yield
156    important clues.  Can you reason out why some sections might be conserved, while
157    others vary wildly?
158    <p>
159    
160    Comparing sets of sequences that have retained the same function is
161    at the heart of understanding cellular machines and the proteins that implement them.
162    We find that looking at sets (often with more than two sequences) and aligning them
163    is important.
164    
 shows an alignment of two proteins (called <i>seq1</i> and <i>seq2</i>).  
165    
166  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>  <h3> Given a set of sequences, align them in a way that minimizes some edit function.</h3>
167    
# Line 170  Line 238 
238    
239  <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>  <h3> Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).</h3>
240    
241  Here is one reasonable tree for the last 5 sequences.  Note that we now have alignments that  From the extant five sequences that are similar and displayed in the previous alignment, we can construct
242  contain thousands of sequences, and even displaying such trees is nontrivial.  a tree that depicts the "phylogenetic history" of the sequences.
243    Here is one reasonable tree for the last 5 sequences.
244    
245  <pre>  <pre>
246                       ,--------------------------------------------------- seq1                       ,--------------------------------------------------- seq1
247                       |                       |
# Line 183  Line 253 
253    |    |
254    |    |
255    |    |
256    |    ,----|
   |  
   |             ,-------------------------------- seq3  
   |             |  
257    |             |    |             |
258    |-------------|    |    |             ,-------------------------------- seq3
259      |    |             |
260      |    |             |
261      |    |-------------|
262    |             |    |             |
263    |             |    |             |
264    |             `------------------------------ seq4    |             `------------------------------ seq4
# Line 197  Line 267 
267    `---------------------------------------------- seq5    `---------------------------------------------- seq5
268  </pre>  </pre>
269    
270  This is an <i>unrooted tree</i>, since we have no idea just looking at extant  The tree suggests that at some point an ancestral
271  sequences about where the root should lie.  cell replicated.  One copy led (through a chain of descendants) to <b>seq5</b>, while the remaining sequences descend
272    from the ther copy.
273    <p>
274    Note that we now have alignments that
275    contain thousands of sequences, and even displaying such trees is nontrivial.
276    Because evolution plays such a central role in the phenomena we study, the construction of alignments
277    and trees in order to compare extant versions of proteins and gain insight into their historical origins
278    is considered basic to the task at hand.
279    
280  <h2>Some Random Facts that You Should Absorb</h2>  <h2>Some Random Facts that You Should Absorb</h2>
281    
# Line 215  Line 292 
292  <li>What is the average length of a gene?  <li>What is the average length of a gene?
293  </ul>  </ul>
294  <br>  <br>
295  It is worth spending just a short bit of time thinking about what types of cellular  It is worth spending just a short bit of time thinking about what types of
296  machines must exist.  Here are a few thoughts to start with  machines must exist in each cell.  Here are a few thoughts to start with
297  <ul>  <ul>
298  <li>  <li>
299  There must be one or more machines that support replication of the cell.  You would  There must be one or more machines that support replication of the cell.  You would
# Line 238  Line 315 
315  to react to it.  For example, many cells can "swim" towards food.  to react to it.  For example, many cells can "swim" towards food.
316  </ul>  </ul>
317  Those were just a few examples.  For any cell, we have many, many machines, and we still  Those were just a few examples.  For any cell, we have many, many machines, and we still
318  do not even understand what some of them do.  do not even understand what some of them do.  Later, we will try to offer a more structured
319    estimate of what is already known.
320  <p>  <p>
321  About 50-60% of the genes occur within 5000 characters of another gene such that  About 50-60% of the genes occur within 5000 characters of another gene such that
322  the two genes encode proteins that are part f the same cellular machine.  If you  the two genes encode proteins that are part of the same cellular machine.  This fact
323  had a genome in which the genes were identified, but the correspondence between the encoded  suggests that just having a large number of genomes would enable a person to group
324  proteins and cellular machines was completely unknown, what could you learn using this fact?  the genes into the machines they implement, without the person understanding the functions
325  Is the situation significantly different if you have 1000 genomes (let us say that  of the machines or the roles played by each protein.
 you know where the genes occur, but the correspondence between the proteins and cellular machines  
 is completely unknown in each case).  
326  <p>  <p>
327  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in  Occasionally, proteins that are usually distinct in most cells are fused into a single protein in
328  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and  a few cells.  In these cases, the fused gene is (by definition) part of a single machine, and
329  in most cells in which the proteins are not fused, the two distinct proteins are separate components  in most cells in which the proteins are not fused, the two distinct proteins are separate components
330  of a single machine.  How wuld you go about locating fused genes, and what could you learn from them?  of a single machine.  This, too, offers clues to support analysis of which proteins go with which machines.
331  <p>  <p>
332  Biologists have figured out the roles of about 50% of the genes.  That is, they can  Biologists have figured out the roles of about 50% of the genes.  That is, they can
333  place the gene in a cellular machine, they know what the machine does, and they know  place the gene in a cellular machine, they know what the machine does, and they know

Legend:
Removed from v.1.2  
changed lines
  Added in v.1.3

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3