The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:


An Abstract View

by Ross Overbeek

What Is a Cell?

A cell is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.

By the term compound I refer to the normal notion of chemical compound.

A cellular machine is a set of proteins that together perform a function. This function is often t transform a set of compounds into another set. Some types of machines (transport machines) are used to move compounds into or out of the cell.

A protein is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).

A genome is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).

A gene is a region in the genome that describes how to build a protein. The description is a sequence of 3-character codons. Each codon corresponds to either a single amino acid or a stop codon. There are three stop codons: {TAA,TAG,TGA}. The genetic code is the table of correspondences between codons and amino acids:

Amino AcidCodons
A GCT, GCC, GCA, GCG
C TGT, TGC
D GAT, GAC
E GAA, GAG
F TTT, TTC
G GGT, GGC, GGA, GGG
H CAT, CAC
I ATT, ATC, ATA
K AAA, AAG
L TTA, TTG, CTT, CTC, CTA, CTG
M ATG
N AAT, AAC
P CCT, CCC, CCA, CCG
Q CAA, CAG
R CGT, CGC, CGA, CGG, AGA, AGG
S TCT, TCC, TCA, TCG, AGT, AGC
T ACT, ACC, ACA, ACG
V GTT, GTC, GTA, GTG
W TGG
Y TAT, TAC
* TAG, TGA, TAA [Stop codons]



This minimal notion of a cell is enough to explain some of the central problems in bioinformatics:

Identify the genes within a genome

This problem simply involves taking a genome (a string of DNA) and locating the set of genes it contains. Does the existence of 100s of genomes (genomes with at least some estimate of where the genes occur) effect how you might do this?

Given two proteins. "align" them in a way that minimizes some edit function.

For example:


seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
                                   ** *. :.:   .*: :**.:**..::***:*  :  :.

seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFPAPVANVESDVGCLELFHG
seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFDVPLVPVKENIYSLELFHG
                  **: :*.*   : :: .**:*:::* * *:* *  :: * .*:. *:.:: .******

seq1            PTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKVVILYP
seq2            PTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHVYVLYP
                *******.******::* ::    * * *.:*.*******:***:.* *: .::* :***

seq1            RGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNSANSINISR
seq2            KGKVSEIQEKQFTTLGRNITALEVDGTFDDCQALVKAAFMDQELNEQLLLTSANSINVAR
                :**:* :*** * *** ** :: :** ** ****** ** *:**:  * *.******::*

seq1            LLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVND
seq2            FLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFIAANNKND
                :*.*  *** * *** :  * : :*:.*******::****:.*.:****:*****.* **
shows an alignment of two proteins (called seq1 and seq2).

Given a set of sequences, align them in a way that minimizes some edit function.

Here is an example of a multiple sequence alignment:

CLUSTAL W (1.83) multiple sequence alignment


seq3            -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
seq4            -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
seq5            -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
seq1            -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
seq2            MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE
                                   *.  . .      .: :.:  **..: ** .*  :  :

seq3            EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
seq4            ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
seq5            TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
seq1            EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
seq2            FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL
                       * :   : .:   :.   ..   :.. :  :         .*:   .     *

seq3            ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
seq4            ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
seq5            ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
seq1            ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
seq2            ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV
                ************.. :::..::  .    * : : :: *******:*.  .      :.:

seq3            FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
seq4            FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
seq5            FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
seq1            VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
seq2            YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA
                 :::*:.: * :*   : *:   .:  :   * ** ** :**  * *  :      * :.

seq3            NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
seq4            NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
seq5            NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
seq1            NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
seq2            NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI
                **** :*.:.*  *** *  ::      .  . .**:****:: ** ..  :***: :::

seq3            VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
seq4            IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
seq5            LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
seq1            AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
seq2            AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------
                 *.* *: . : :  . :       ::: :**:  ..*: * :  *.

seq3            FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
seq4            FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
seq5            LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
seq1            -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
seq2            -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG
                                      .    : :                :** * .  *  :

seq3            RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
seq4            GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
seq5            VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
seq1            LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
seq2            LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS
                        : * ****.** :::    :*     *  *            :  :   .:

seq3            FVSQHGNRGKPL
seq4            FMSGHLGA----
seq5            FIEKKAL-----
seq1            LMMNHQ------
seq2            YLLGK-------
                 :  :

Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).

Here is one reasonable tree for the last 5 sequences. Note that we now have alignments that contain thousands of sequences, and even displaying such trees is nontrivial.
                     ,--------------------------------------------------- seq1
                     |
                     |
  ,------------------|
  |                  |
  |                  |
  |                  `---------------------------------------------- seq2
  |
  |
  |
  |
  |
  |             ,-------------------------------- seq3
  |             |
  |             |
  |-------------|
  |             |
  |             |
  |             `------------------------------ seq4
  |
  |
  `---------------------------------------------- seq5
This is an unrooted tree, since we have no idea just looking at extant sequences about where the root should lie.

Some Random Facts that You Should Absorb

Most genomes of bacteria contain between 400,000 and 12,000,000 characters. Normally, the genes in a genome cover abut 90% of the genome. Normally, there is about one gene per 1000 characters in a bacterial genome.

So,


It is worth spending just a short bit of time thinking about what types of cellular machines must exist. Here are a few thoughts to start with Those were just a few examples. For any cell, we have many, many machines, and we still do not even understand what some of them do.

About 50-60% of the genes occur within 5000 characters of another gene such that the two genes encode proteins that are part f the same cellular machine. If you had a genome in which the genes were identified, but the correspondence between the encoded proteins and cellular machines was completely unknown, what could you learn using this fact? Is the situation significantly different if you have 1000 genomes (let us say that you know where the genes occur, but the correspondence between the proteins and cellular machines is completely unknown in each case).

Occasionally, proteins that are usually distinct in most cells are fused into a single protein in a few cells. In these cases, the fused gene is (by definition) part of a single machine, and in most cells in which the proteins are not fused, the two distinct proteins are separate components of a single machine. How wuld you go about locating fused genes, and what could you learn from them?

Biologists have figured out the roles of about 50% of the genes. That is, they can place the gene in a cellular machine, they know what the machine does, and they know the specific role of the gene in sustaining the functionality of the machine.

Imposing a Structure on Characterizing the Inventory

One central goal of bioinformatics is to support an accurate characterization of the cellular machinery for each cell. It is of major importance to biologsts that we be able to support comparative analysis of cells. Perhaps, the most important aspect of understanding cells relates to their origin in an evolutionary process. Cells have a long evolutionary history dating back billions of years. The machines we see in cells today arose in the past, so we expect to see many current cells using machinery that resembles what turns up in other cells. When we compare machines from different cells they often look remarkably similar. On the other hand, those that had a common origin in a cell that existed billions of years in the past may now have versions that are not very similar. Modifications, optimizations, and insignificant alterations all combine to explore the space of operational possibilities for each type of machine. Hence, we need a framework for studying similarities and differences in the cellular machines and the proteins that implement them.

Here is a short formulation of one way to do this:



We do not actually know what machines are present in a cell. We are in the midst of a grand effort to clarify which are there and what they do. The formulation of subsystems as abstract machines in which each row of the subsystem describes a specific cellular machine that is believed to be present, represents a way to maintain a collection of estimates or assertions.

A protein family is defined to be a set of proteins that implement the same functional roles and are similar over the entire lengths of the proteins.

We seek a situation in which each protein occurs in one or more subsystems and in a single protein family. The computational tasks imposed by such a goal are obvious:

States of the Cell

The notion of subsystem was introduced as an abstract machine -- that is, as an attempt to create a framework for understanding variations within specific celular machines via a form of comparative analysis. In any specific cell, sets of specific cellular machines are switched on and off as units. That is, they are co-regulated. We will call such a set of co-regulated cellular machines a regulon (note that a regulon is often a set containing a single cellular machine). A state of a cell will be defined as the set of regulons that are operational at a point in time. Thus, a state amounts to the set of cellular machines that are operational at one instant.

If we think of a car as a bag of machines that interact to make it function, we might consider there to be a huge number of states. There are many very minor "machines" like the arm rest (or the radio, r the night light) that can be on or off. However, we can divide the states of a car into major groupings based on the status of some key "machines". For example, "off" (the state in which the engine is turned off and the car is parked) and "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into two "major states".

Similarly, I believe that we should think about major states of the cell as being determined by the functioning (or not) of a limited set of regulons. The determination of these regulons, the major states, and how transitions between are managed all are now parts of the picture being filed in.

Microarrays

Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the second list contains genes that were "active" in the second but not the first. If a cellular machine utilizes protein X, and X is in the first list, and if X is used in only one cellular machine, then it would be reasonable to infer that you could say that the machine was active in the first state, but not the second. If one knew the regulons for a specific cell, it would go a long way to suport extraction of insights from these microarrays. On the other hand, if one had many, many microarrays, and if the specific cellular machines for the cell are known, then one could make substantial progress in uncovering the exact composition of the regulons that make up the cell.