Analysis of Variation Within a Set of Closely Related Genomes

Both the SEED and the SPROUT will include features designed to support analysis of variation between closely related genomes. For the record, we expect to see 20-100 closely related genomes for a number of pathogens and production strains by early 2007. It is possible that we will see such collections even earlier. Within a very few more years it will be commonplace to see hundreds of closely related strains. The driving force will be the plummeting costs of sequencing and resequencing. The sole issue is how much can be learned from such collections. If such efforts actually produce better production strains, new anibiotics, or new vaccines, there is no question that the sequencing costs can easily be justified.

Major and Minor Variation: the Basic Concepts

Consider a set of very closely related genomes. There will be regions in which all of the genomes contain very nearly identical sequence for long stretches, although each genome will contain insertions, deletions and rearrangements. A block is a corresponding section from two or more genomes that can easily be accurately aligned. Often the sequence similarity is near 100%. An unaligned sequence is a sequence from one of the genomes. >From an initial collection of similar genomes, we will form a collection of blocks and unaligned sequences such that:
Note that these three conditions are somewhat imprecise and do not, as they stand, force a unique collection of blocks and unaligned sequences. However, for our purposes, this is not critical. One of the most serious ambiguities involves the notion of "accurately aligned". On the one extreme, we might force blocks to be free of indels. On the other, we might allow fairly substantial stretches (say, up to 20) indels. The exact parameters that we use will be important, but they represent a detail to be determined as we proceed.

Blocks in which one or more genomes are not represented and all unaligned sequences are thought of as major variations. Columns within a block that contain differing characters are thought of as minor variations or SNPs.

Finally, we can dispense with the concept of "unaligned sequence", if we allow blocks containing a single entry. I think that it is useful to leave the diswcussion above in terms of blocks and unaligned sequences and to treat this merger as an implementation detail.

The SPROUT Entities and Relationships Needed to Support Variation

We will be using the SEED and the SPROUT to connect variation to phenotype. The discussion so far supports the addition of the following SPROUT entities and relationships:
A key component of the analysis required to do this will be to correlate variations with effected genes. This would argue for the addition of the following relationships:

Relating Phenotype to Variation

The central goal of encoding variation in genomes will be to relate these differences to differences in phenotype. To understand what is needed requires that we first consider some typical biological experiements. Then, when we consider how to encode the relationships between phenotype and variation, whatever choices we select should at least adequately handle these prototypical examples:

The Issue of What Is Meant by a Genome?

The SEED and SPROUT model of a genome amounts to a collection of contigs that cover at least 90-95% of the actual genome (at least this is the SPROUT model -- this amounts to the SEED model of a complete genome. which clearly includes those that are only nearly complete). When people discuss knockouts or measurements against hundreds of unsequenced clinical isolates, clearly some new issues are appearing. How should these cases be handled?

Perhaps a good place to start, as in many things bacterial, is with E.coli. We now have, thanks to the efforts of Mori's group in Japan, a set of knockouts for a majority of the E.coli genes. This amounts to a collection of thousands of distinct strains each carefully constructed to include all but one gene from a sequenced strain (i.e., each of the distinct strains is believed to be essentially identical to a known ancestor, except for the disruption of a single gene). Suppose that we have measured some property of each of these thousands of strains. How should we represent the results?

To begin with, we need to accept that we are dealing with thousands of distinct genomes. They need to have distinct IDs and be treated as different. On the other hand, we do not have the precise sequence of these thousands of genomes. Rather, we think of them each as "the common genome with gene X disrupted". That is, in most aspects they are viewed as completely identical to the original strain. We do have the option of basically generating the conceptual genome (at least a pretty good approximation of it) and adding it to a running version of the SEED or the SPROUT. However, there is another approach. I propose adding a single entity (the recipe of how a genome was constructed) and a number of relationships to the SPROUT model:

In this case, the newly-generated Genome is not related to any Features or Contigs. Rather, it is related to a recipe, which has the necessary relationships to the original genome.

Back to the Issue of Connecting Phenotype to Variation

In my view, there is no single "right" way to encode phenotype; there are many ways and none that I know of is clearly superior. I am going to propose an approach which has the redeeming characteristic of being both simple and general.

I will begin by thinking of a phenotype as no more than a 3-tuple that can be attached as a tag to one or more genomes. The fields in the tuple are

  1. an arbitrary label,
  2. a single floating-point number, and
  3. a single text field.
For boolean attributes (such as essential or gram-positive), the floating-point number will simply be 0 or 1. In other cases it will reflect a numeric value to be associated with one or more genomes. That is, I have essentially reduced the notion of phenotype to the quite different notion of a measurement, and assume that any single measurement can apply to an arbitrary set of genomes. In this case, we assume that variations connected to the distinct genomes will ultimately explain the differences in measurement. Sorting out the relevant causal relationships becomes one of the central issues of this century.

However, we may wish to connect measurements with variation, as well. That is, once we have catalogued the variations within a family of closely-related genomes, we may wish to attach measurements to the variations themselves. Essentially, we wish to be able to encode statements of the form "of the set of X genomes that have variation Y, P had the following measurement and N did not". This makes sense for boolean measurements. Non-boolean properties would be handled (less than elegantly) by dividing a range of potential values into discreet "buckets" and then considering presence or absence of a value in a bucket as a boolean property.

These considerations cause me to propose the following additions to the existing SPROUT model:

This appears to me to be a reasonable approach. I solicit comments.