The Role of Bioinformatics in Interpretating Genomes of Unicellular Organisms:

An Abstract View

by Ross Overbeek, ...


This strange document began as a tutorial for computer scientists and mathematicians. It was supposed to somehow introduce them to the computational issues in genome analysis. It was requested by an instructor in a computer class. Overbeek in attempting to respond to this request formulated an abstraction that he began to believe had significance beyond the tutorial.

This document is a set of working notes relating to the abstract. It is not organized properly as an abstraction, a tutorial, or an essay on the role of bioinformatics in support of biological research. It is, however, organized properly as a working document that relates to all of these goals.

It begins with a development of the abstraction. This will be suitable for mathematicians or computer scientists. The abstraction is developed in four steps: the basic abstraction, the enhanced abstraction needed to support basic bioinformatics support for biologists, and finally the third step which includes suport for the notion of regulation. The intent throughout this discussion will be to seek a minimal set of concepts needed to effectively capture the essence of the required data. Unlike almost all efforts to lay a foundation for tutorials, software or research in biology, this effort focuses on leaving out as much as possible. While we do believe that there is an almost unlimited complexity that can be introduced, and almost all of it is needed for some specific goals, the vast majority of tools and discussions require (we believe) relatively few concepts. As they say, "the proof is in the pudding."

The second section will feature a bit more tutorial comments. It may well repeat much of what is in Part 1. This part is offered as a way of easing a computer scientist of mathematician into the issues that need to be considered, if they wish to try to do useful research relating to the genomics revolution. Eventually, this part will be dramatically expanded by giving condensed summaries of the machines of the cell broken into two broad sets: the metabolic network and the cellular machinery not directly included in the metabolic network. Loosely, this separates what would be learned in a microbial biochemistry class (when they exist) from what would be learned in a course on molecular biology.

The third part is an essay is an attempt to characterize our view on

As such, it is undoubtedly an arrogant formulation by a group of individuals with minimal background in biology.

The fourth section will focus on the imlications of the abstractions in software development. This is a bit of a radical proposal that makes sense to us (and is in an area that we can legitimately claim expertise).

Part 1: The Abstractions

The cell: a Minimal Perspective

A cell is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.

By the term compound we refer to the normal notion of chemical compound.

A cellular machine is a set of proteins that together perform a function. Unless otherwise noted, when we use the term machine we will always be speaking of a cellular machine. Many machines transform one set of compounds into another set. Some machines (transport machines) are used to move compounds into or out of the cell. Later we will try to convey a more comprehensive notion of what functions are implemented by machines that we understand.

A protein is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).

A genome is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).

A gene is a region in the genome that describes how to build a protein. The description is a sequence of 3-character codons. Each codon corresponds to either a single amino acid or a stop codon. There are three stop codons: {TAA,TAG,TGA}. The genetic code is the table of correspondences between codons and amino acids:

Amino Acid Codons
* TAG, TGA, TAA [Stop codons]

The process of building a protein as a string of amino acids from the gene containing codons is called expressing the gene.
A subsystem (i.e., an abstract cellular machine) is a set of functional roles. Each protein implements one or more functional roles. The set of functional roles implemented by the protein is called the function of the protein. The function of a multifunctional protein that implements {functional-role-1,functional-role-2} is normally written as functional-role-1 / functional-role-2.

A populated subsystem is a subsystem with an attached spreadsheet. Each column in the spreadsheet corresponds to a functional role in the subsystem, and each row corresponds to a specific genome. Each cell in the spreadsheet contains the genes from the corresponding genome that implement the designated functional role (there may be 0 or more such genes).

We do not actually know what machines are present in a cell. We are in the midst of a grand effort to clarify which are there and what they do. The formulation of subsystems as abstract machines in which each row of the subsystem describes a specific cellular machine that is believed to be present, represents a way to maintain a collection of estimates or assertions.

A protein family is defined to be a set of proteins that implement the same functional roles and are similar over the entire lengths of the proteins.

We seek a situation in which each protein occurs in one or more subsystems and in a single protein family.

In any specific cell, sets of specific cellular machines are switched on and off as units. That is, they are co-regulated. We will call such a set of co-regulated cellular machines a regulon (note that a regulon is often a set containing a single cellular machine). A state of a cell will be defined as the set of regulons that are operational at a point in time. Thus, a state amounts to the set of cellular machines that are operational at one instant.

Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the second list contains genes that were "active" in the second but not the first. If a cellular machine utilizes protein X, and X is in the first list, and if X is used in only one cellular machine, then it would be reasonable to infer that you could say that the machine was active in the first state, but not the second.

The cell: the Enhanced Formlation Needed to Support Bioinformatics

In the enhanced abstraction, we need to losen up some concepts. In particular, This loosened up formulation represents a very minimal set of changes. They should be left out of the basic tutorial for computer scientists and mathematicians.

The cell: Adding the Concepts Needed to Discuss Transcriptional Regulation

In the final version of the abstraction, we add the minimal set of notions needed to support analysis of transcriptional regulation. An operon is a set of contiguous genes that are all on the same strand and are all co-regulated. We consider a gene that is not co-regulated with any adjacent genes to be an operon composed of just itself. A binding site is a small region of DNA (normally occurring a short space ahead of an operon) that acts as a switch turning the operon "on" or "off". When a specific protein or expressed RNA called a transcriptional regulator binds the site, it flips the switch. One or more specific transcriptional regulators can bind a specific site (i.e., sets of sites are associated with each specific transcriptional regulator). The effect of a regulator binding at a site always has the same effect (either activating or deactivating the operon), but which effect depends on the site-regulator pair.

Part 1: Tutorial Notes

Notes for The Basic Abstraction

We will be speaking about organisms that are a single cell. At some point life began on earth. The single-celled organisms that we know of replicate producing copies of themselves that have genomes which usually have very, very similar content to that of the parent cell. Evolution is the process in which cells replicate with some alterations in their genomes, are subjected to selective pressure, and survive or not depending on many somewhat random factors. The makeup of cells (i.e., the genomes they contain and the machines that define what they are capable of doing) changes gradually (and sometimes not so gradually) as time passes.

The original life forms that existed billions of years ago have evolved into three broad categories of life forms. That is, the evolutinary process led to early divisions, and these led to three main categories of single-celled organisms. We call these three forms the archaea, the bacteria, and the eukaryotes. A majority of the organisms for which we have acquired complete genomes are from the bacteria, although the numbers are rapidly growing for all three domains.

The minimal notion of a cell is enough to explain some of the basic problems in bioinformatics:

Identify the genes within a genome

If we are to understand the contents of genomes, we will need to locate the genes that occur in each genome. This problem simply involves taking a genome (a string of DNA) and locating the set of genes it contains. In the case of bacteria and archaea, we know pretty well how to locate the genes. Once we have identified instances from many genomes, it becomes possible to recognize the genes in a new genome by just looking for things similar to those we already understand. The following problem is At the heart of reconizing when two genes are "similar".

Given two genes. "align" them in a way that minimizes some edit function.

For example, here is what you see when you align two genes from distinct organisms:
* * * * * *** *** ** * **** * *** **** * ***
***** ***** *** **** * ** ** * ** *** ****** ***
********* **** ** ** ** ** ***** * ** ** ** *** ** *
** ** *** *********** ** **** *** ** * * ***
******** ** ***** ***** * * ** ** * *****
* ******** *** *** *** * ** ** * * ** ** * ****

The sequences are recognizably similar, and in fact implement exactly the same function in the two cells. If we align the protein sequences corresponding to these two genes, we get
:* * * :::*:* ****:**:. *:: * **: *: *:***:*:***.:* ***

***:****.***:****:: *:* ****.. *.*::.:****.: ***: .:

.:***** **.: :*.*** **::*:* * :**:.:*:
There is a great deal of work relating to recognizing when two sequences are similar and whether or not they had a common ancestor. Understanding why selective pressure conserves sections of sequences, but not others, will yield important clues. Can you reason out why some sections might be conserved, while others vary wildly?

Comparing sets of sequences that have retained the same function is at the heart of understanding cellular machines and the proteins that implement them. We find that looking at sets (often with more than two sequences) and aligning them is important.

Given a set of sequences, align them in a way that minimizes some edit function.

Here is an example of a multiple sequence alignment:

CLUSTAL W (1.83) multiple sequence alignment

*. . . .: :.: **..: ** .* : :

* : : .: :. .. :.. : : .*: . *

************.. :::..:: . * : : :: *******:*. . :.:

:::*:.: * :* : *: .: : * ** ** :** * * : * :.

**** :*.:.* *** * :: . . .**:****:: ** .. :***: :::

*.* *: . : : . : ::: :**: ..*: * : *.

. : : :** * . * :

: * ****.** ::: :* * * : : .:

seq4 FMSGHLGA----
seq5 FIEKKAL-----
seq1 LMMNHQ------
seq2 YLLGK-------
: :

Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).

From the extant five sequences that are similar and displayed in the previous alignment, we can construct a tree that depicts the "phylogenetic history" of the sequences. Here is one reasonable tree for the last 5 sequences.
                     ,--------------------------------------------------- seq1
  |                  |
  |                  |
  |                  `---------------------------------------------- seq2
  |             ,-------------------------------- seq3
  |             |
  |             |
  |             |
  |             |
  |             `------------------------------ seq4
  `---------------------------------------------- seq5
The tree suggests that at some point an ancestral cell replicated. One copy led (through a chain of descendants) to seq5, while the remaining sequences descend from the ther copy.

Note that we now have alignments that contain thousands of sequences, and even displaying such trees is nontrivial. Because evolution plays such a central role in the phenomena we study, the construction of alignments and trees in order to compare extant versions of proteins and gain insight into their historical origins is considered basic to the task at hand.

Some Random Facts that You Should Absorb

Most genomes of bacteria contain between 400,000 and 12,000,000 characters. Normally, the genes in a genome cover abut 90% of the genome. Normally, there is about one gene per 1000 characters in a bacterial genome.


It is worth spending just a short bit of time thinking about what types of machines must exist in each cell. Here are a few thoughts to start with Those were just a few examples. For any cell, we have many, many machines, and we still do not even understand what some of them do. Later, we will try to offer a more structured estimate of what is already known.

About 50-60% of the genes occur within 5000 characters of another gene such that the two genes encode proteins that are part of the same cellular machine. This fact suggests that just having a large number of genomes would enable a person to group the genes into the machines they implement, without the person understanding the functions of the machines or the roles played by each protein.

Occasionally, proteins that are usually distinct in most cells are fused into a single protein in a few cells. In these cases, the fused gene is (by definition) part of a single machine, and in most cells in which the proteins are not fused, the two distinct proteins are separate components of a single machine. This, too, offers clues to support analysis of which proteins go with which machines.

Biologists have figured out the roles of about 50% of the genes. That is, they can place the gene in a cellular machine, they know what the machine does, and they know the specific role of the gene in sustaining the functionality of the machine.

One central goal of bioinformatics is to support an accurate characterization of the cellular machinery for each cell. It is of major importance to biologsts that we be able to support comparative analysis of cells. Perhaps, the most important aspect of understanding cells relates to their origin in an evolutionary process. Cells have a long evolutionary history dating back billions of years. The machines we see in cells today arose in the past, so we expect to see many current cells using machinery that resembles what turns up in other cells. When we compare machines from different cells they often look remarkably similar. On the other hand, those that had a common origin in a cell that existed billions of years in the past may now have versions that are not very similar. Modifications, optimizations, and insignificant alterations all combine to explore the space of operational possibilities for each type of machine. Hence, we need a framework for studying similarities and differences in the cellular machines and the proteins that implement them.

Here is a short formulation of one way to do this:

We do not actually know what machines are present in a cell. We are in the midst of a grand effort to clarify which are there and what they do. The formulation of subsystems as abstract machines in which each row of the subsystem describes a specific cellular machine that is believed to be present, represents a way to maintain a collection of estimates or assertions.

A protein family is defined to be a set of proteins that implement the same functional roles and are similar over the entire lengths of the proteins.

We seek a situation in which each protein occurs in one or more subsystems and in a single protein family. The computational tasks imposed by such a goal are obvious:

States of the Cell

The notion of subsystem was introduced as an abstract machine -- that is, as an attempt to create a framework for understanding variations within specific celular machines via a form of comparative analysis. In any specific cell, sets of specific cellular machines are switched on and off as units. That is, they are co-regulated. We will call such a set of co-regulated cellular machines a regulon (note that a regulon is often a set containing a single cellular machine). A state of a cell will be defined as the set of regulons that are operational at a point in time. Thus, a state amounts to the set of cellular machines that are operational at one instant.

If we think of a car as a bag of machines that interact to make it function, we might consider there to be a huge number of states. There are many very minor "machines" like the arm rest (or the radio, or the night light) that can be on or off. However, we can divide the states of a car into major groupings based on the status of some key "machines". For example, "off" (the state in which the engine is turned off and the car is parked) and "on" (the engine is running and the car is moving) might be viewed as a crude partitioning of the states into two "major states".

Similarly, I believe that we should think about major states of the cell as being determined by the functioning (or not) of a limited set of regulons. The determination of these regulons, the major states, and how transitions between are managed all are now parts of the picture being filed in.


Microarrays are, for a given genome, two lists of genes that "changed expression levels" between two states of a cell. Basicaly, the first list contains genes that were "active" during the first state, but not the second; and the second list contains genes that were "active" in the second but not the first. If a cellular machine utilizes protein X, and X is in the first list, and if X is used in only one cellular machine, then it would be reasonable to infer that you could say that the machine was active in the first state, but not the second. If one knew the regulons for a specific cell, it would go a long way to suport extraction of insights from these microarrays. On the other hand, if one had many, many microarrays, and if the specific cellular machines for the cell are known, then one could make substantial progress in uncovering the exact composition of the regulons that make up the cell.

We are just now reaching the point where we do, in fact, have hundreds of microarrays (each representing changes between two sampled states of the cell).  
Let us reflect on how one might use this data to uncover the regulons that are represented and how they relate to the major "states of the cell".

We might begin by trying to determine sets of genes from each subsystem that appear to "move together".   Actually, we want to arrive at a set of genes that perform a well-defined function, some subset of these almost always show up in the microarrays as "moving together".  Of these, if we have genes that occur only in a single subsystem, then it would be reasonable as thinking of these as signatures for set of genes.  The most natural way to do this would be to start with metabolic subsystems, or even better scenarios (discussed below) which are subsets of functional roles from a metabolic subsystem such that the subset if a connected set with well-defined inputs and outputs.  We wish then to define discovery of the regulon sets associated with each condition as follows:

  1.  First, for each scenario define 

  2. Then define the set of regulons.  Each regulon is  a set of scenarios.  There is a cost cost_reg associated with the definition of each regulon.  This prevents the definition of numerous regulons all containing just one scenario.  If the penalty is set too high, only one regulon will be defined.  If it is set too low, then a large set of small regulons results.

  3. Finally, you need to define the set of regulons that were activated for each microarray and the set that were deactivated.

  4. Now, you compute a score for your decisions as score = P - M - (cost_reg * number_of_defined_regulons * number_of_microarrays) where

Notes for the Enhanced Abstraction

The process of expressing a gene amounts to using the gene to produce the functional component of a machine (a protein for a protein-encoding gene, and an RNA for an RNA-encoding gene). The process of expressing a protein-encoding gene takes a gene (a string of DNA formed by concatenating a sequence of regions from contigs) and producing a protein is normally thought of as taking place in two steps. Transcription is the process of a specific machine moving along the contig and making a copy of the gene as RNA. This string of RNA is then translated by a separate machine. The machine that performs the copying of the gene into a string of RNA is called an RNA polymerase. The machine to translate the RNA into a protein, the ribosome, is made up of both proteins and RNA components.

Machines can be made up of both protein and RNA components, although most machines are built from just proteins. Some of the most fundamental questions in biology relate to how life started and the steps required to gradually enrich the basic machinery to the point where this magnificent information storage and maintenance system based on DNA, RNA and proteins could have arisen. There is much that can be inferred by reasoning back from what we now observe and reasoning forward from the relatively little we know of what the early earth was like. One possible set of goals would be to first understand in detail the inventory of components we now see in life forms, composing something analogous to a CAD/CAM system describing life forms. Then, as a second step, to understand the sequence of transformations that led from some initial raw components to initial life forms to those we have seen and characterized.

The need to allow occasional "nonstandard" characters in protein sequences and a loosening of the corespondence between a gene and characters in the protein sequence it can be used to build results from the fact that evolution has produced the existing genetic codes and they continue to evolve (either converging or diverging depending on the outcome of basically random processes operating under selective pressure).

Notes on the Abstraction Extended to Support Regulation

There are two basically different regulatory mechanisms in the cell. In one, you have a metabolic network in which fluxes are tightly controlled by positive and negative feeback loops. This metabolic regulation occurs very rapidly. Transcriptional regulation occurs orders of magnitude more slowly. It is just this transcriptional regulation that we consider in this extension.

As the cell changes state, regulons are activated or de-activated by transcriptional regulators (either protein or RNA) binding to specific sites in the DNA. This model has the redeeming characteristic of simplicity. It is certainly the case that there are innumerable important issues that it disregards (e.g., regulation based on DNA packaging, due to small RNAs binding the RNAs produced by transcription, etc.). In forming any clear notion of transcriptional regulation and how it is achieved, we will need to carefully separate these different mechanisms, since they have fundamentally different modes of control and operation. We are arguing that the notion of a protein or RNA being used to flip regulons on and off by binding to control sites within the genome is a major form of regulation and probably the right place to start any effort to formulate a useful abstraction.

The Role of Bioinformatics in Supporting the Genomic Revolution

Within the growing genomics revolution, one can easily divide developments and goals into those relating to advances in medicine and agricultue from those relating to pure science. Here we consider only issues relating to pushing advances in basic research. Here is an overview of our perspective:
  1. The different life forms that now exist were produced by an evolutionary process, which leads to our view that comparative analysis is the key to understanding. Biological machines that exist in complex forms will often also still exist in simpler forms (usually in simpler organisms).
  2. Unravelling exactly how a machine works is more easily done in simpler organisms. They are easier to work with, and it is easier to gather the data needed to support comparative analysis.
  3. This leads to the view that we should try to understand single-celled organisms to lay the foundation for analysis of multicelluar organisms.
  4. The characterization of unicellular life will require access to orders of magnitude more data than exist now (we have more-or-less complete genomes for about 1000 genomes, but that represents a small fraction of a percent of extant single-celled life forms).
  5. The immediate basic steps that are taking place are roughly:

    1. Attempt to formulate a growing list of abstract machines that correspond to the many specific machines that implement te same goal. These abstract machines (subsystems) represent the basic units that make up life forms.
    2. Create protein and RNA families in which the members are all homologous (share a common ancestor), remain similar over almost all of the sequence, and all implement a common function.
    3. Build alignments for each protein family, along with phylogenetic trees that represent an estimate of the history of how these specific sequences evolved.
    4. Provide a computational framework to support continued maintenance and development of these basic data types.

    Groups are now actively pursuing all of these goals.  For individuals wishing to build a research program, we suggest collaborating with an existing group or moving to one of the newer areas that are now emerging.

  6. A limited number of groups have progressed to the point where they can create models of an organism that display predictive capabilities. There are many forms of modeling. In our view it is important that we reach the state where we can routinely model states of the cell, transitions between states, and metabolic characteristics of the cell. We believe that it is now possible to create fairly comprehensive representations of the metabolic networks of some bacteria. In these cases, we have substantial amounts of physiological data, the number of abstract machines in the cell is fairly limited, and it is possible to do compare the predictions against observed results.   An effort has begun by a team within the SEED project, led by researchers from Hope Colege, to develop a library of what they call scenarios.  These scenarios capture the idea of a specific machine implementing a metabolic transformation operating with well-defined inputs and outputs. From a large and growing number of scenarios in this library, they automatically reconstruct metabolic networks for most of the bacteria for which genomes have been sequenced.  This effort is seeting the stage for widespread whole genome metabolic modeling. 
  7. Rapid progress has been made in our ability to recognize regulatory binding sites and to use them with knowledge of specific machines to create a consistent picture of regulons in some bacteria.  This technology has been gathering adherents over the last five years and we believe that it will play a significant role in clarifying regulons, additions proteins that will be added to specific machines, and a growing understanding of states of the cell. 
Having said all that, is it possible to list some of the important, high-payout bioinformatic questions that are worth pondering?  Here is a list for your consideration:

  1. The definition of the location of genes  for bacterialial genomes needs cleaning up.  The situation is made somewhat more interesting by a growing use of sequencing technologies that produce systematic errors leading to numerous frameshifts and poorly called start locations.  Fixing these would be a problem of modest difficulty and very modest reward.  The situation in eukaryotic genomes is quite different.  The problem of defining the genes in a eukaryotic genome is still quite unsolved,  We conjecture that
  2. The creation of populated subsystems is essentially a task for expert biologists.  However, the tools to support the task are a reasonable focus for bioinformatic projects.  The tools needed to delicately separate the roles of paralogous proteins have been illustrated in the works of Jensen and Bonner, among others.  These tools relate to use of alignments, trees and motifs to define the decision procedures needed to classify proteins into one of several closely-related families.
  3. The development  of a self-consistent set of protein families is a task closely related to the one above.  At this point in time there are several major efforts currently building such protein families.  The development of protocols for maintenance of the families, studying the evolutionary history of related families, development of motifs that characterize specific families, and so forth all represent parts of a large classification problem.
  4. There are a class of tools that attempt to spot functional coupling between specific proteins.  Some are bioinformatic (like the chromosomal clustering and fusion phenomena briefly discussed above).  Some are essentially experimental data (e.g., protein-protein interaction data or microarray data).  The integration of evidence into a system capable of predicting whether  or not two specific proteins are both components of a single machine has been attemtped, but much more remains to be done.  The closely-related problem of determining whether or not two protein families are functionally coupled (and precisely what that means) should be considered simultaneously.
  5. Defining regulons by gradually composing a consistent interpretation of subsystems, regulatory sites, and physiological data is a task that is semi-automated.  Devlopment of a fully automated version seems too ambitious, but developing tools to increase the productivity of biologists developing these models of transcriptional regulation is certainly going to gain much more attention.
  6. Development of a meaningful notion of states of a cell is a problem seems to us to have many of the characteristics one wants:  it is a problem for which relevant data is starting to appear, many aspects of the needed infrastructure have only recently appeared, and the outcome may be of fundamental significance.
  7. To what extent is it possible to predict the protein families which have instances in a given cell given the closest 10 neighboring genomes and detailed information on the families they contain?
  8. Is it possible to think of a set of protein families as major predictors that would allow you to infer the presence or absence of many other families.

The Role of Abstraction in Setting the Stage for Software Development and Modeling

In this section, we argue that the abstraction is much more than just a pedagogical aid.  It will form the conceptual under-pinnings of the software needed to support work on the problems described in the last section (as well as numerous others that will become apparent as the revolution progresses).