Analysis of Variation Within a Set of Closely Related Genomes
Both the SEED and the SPROUT will include features designed to support
analysis of variation between closely related genomes. For the
record, we expect to see 20-100 closely related genomes for a number
of pathogens and production strains by early 2007. It is possible
that we will see such collections even earlier. Within a very
few more years it will be commonplace to see hundreds of closely
related strains. The driving force will be the plummeting costs of
sequencing and resequencing. The sole issue is how much can be
learned from such collections. If such efforts actually produce
better production strains, new anibiotics, or new vaccines, there is
no question that the sequencing costs can easily be justified.
Major and Minor Variation: the Basic Concepts
Consider a set of very closely related genomes. There will be regions
in which all of the genomes contain very nearly identical sequence for
long stretches, although each genome will contain insertions,
deletions and rearrangements. A block is a corresponding section from two or
more genomes that can easily be accurately aligned. Often the
sequence similarity is near 100%. An unaligned sequence is a
sequence from one of the genomes.
>From an initial collection of similar genomes, we will
form a collection of blocks and unaligned sequences such that:
each genome can be viewed as a sequence of entries from blocks and
if part of a gene occurs within a block, the whole gene occurs within
each block contains at most one entry from any single genome, and
whenever a block can be constructed (rather than leaving a collection
of unaligned sequences), it will be formed.
Note that these three conditions are somewhat imprecise and do not, as
they stand, force a unique collection of blocks and unaligned
sequences. However, for our purposes, this is not critical. One of
the most serious ambiguities involves the notion of "accurately
aligned". On the one extreme, we might force blocks to be free of
indels. On the other, we might allow fairly substantial stretches
(say, up to 20) indels. The exact parameters that we use will be
important, but they represent a detail to be determined as we proceed.
Blocks in which one or more genomes are not represented and all
unaligned sequences are thought of as major variations.
Columns within a block that contain differing characters are thought
of as minor variations or SNPs.
Finally, we can dispense with the concept of "unaligned sequence", if
we allow blocks containing a single entry. I think that it is useful
to leave the diswcussion above in terms of blocks and unaligned
sequences and to treat this merger as an implementation detail.
The SPROUT Entities and Relationships Needed to Support Variation
We will be using the SEED and the SPROUT to connect variation to
phenotype. The discussion so far supports the addition of the
following SPROUT entities and relationships:
- Block ContainsSubsequenceFrom Genome with three fields of
intersection data (Contig,Beg,End)
- MajorVariation IsExposedBy Block
- SNP IsMinorVariationIn Block with one field of intersection
data (Offset into the Block), and
- SNP HasCharacterIn Genome with one field of intersection
data (Value which is the character in the genome)
A key component of the analysis required to do this will
be to correlate variations with effected genes. This would argue for
the addition of the following relationships:
- Feature ParticipatesInMajorVariationWith MajorVariation,
- Feature ContainsSNP SNP,
- Feature HasUpStreamImpactedByMajorVariation MajorVariation
- Feature HasDownStreamImpactedByMajorVariation MajorVariation
- Feature HasUpStreamImpactedBySNP SNP
- Feature HasDownStreamImpactedBySNP SNP
Relating Phenotype to Variation
The central goal of encoding variation in genomes will be to relate
these differences to differences in phenotype. To understand what is
needed requires that we first consider some typical biological
experiements. Then, when we consider how to encode the relationships
between phenotype and variation, whatever choices we select should at
least adequately handle these prototypical examples:
- Biologists frequently perform knockouts and then measure changes
between those organisms with and without the knockout. Sometimes this
is done for a single gene, and the result is summarized in a journal
article. Sometimes it is methodically done for large numbers of genes
and the changes being examined are things like "essentiality" or
"virulence". Sometimes the measurements record profiles of
expression under differing conditions for the two organisms.
- A closely related type of data is acquired when one or more genes
are "added to a genome" (either via actual integration or by the
addition of a plasmid), and measurements are taken comparing the
derived genome with the original genome.
- A third type of data amounts to taking a large number of
unsequenced organisms,recording the presence or absence of a specific
known variation, and attaching some measurement of phenotype.
- A final type of data amounts to taking a small set of strains from
the set of those that have been sequenced and acquiring numerous
measurements, each reflecting an aspect of phentotype.
The Issue of What Is Meant by a Genome?
The SEED and SPROUT model of a genome amounts to a collection of
contigs that cover at least 90-95% of the actual genome (at least this
is the SPROUT model -- this amounts to the SEED model of a complete
genome. which clearly includes those that are only nearly
complete). When people discuss knockouts or measurements against
hundreds of unsequenced clinical isolates, clearly some new issues are
appearing. How should these cases be handled?
Perhaps a good place to start, as in many things bacterial, is with
E.coli. We now have, thanks to the efforts of Mori's group in
Japan, a set of knockouts for a majority of the E.coli genes.
This amounts to a collection of thousands of distinct strains each
carefully constructed to include all but one gene from a sequenced
strain (i.e., each of the distinct strains is believed to be
essentially identical to a known ancestor, except for the disruption
of a single gene). Suppose that we have measured some property of
each of these thousands of strains. How should we represent the
To begin with, we need to accept that we are dealing with thousands of
distinct genomes. They need to have distinct IDs and be treated as
different. On the other hand, we do not have the precise sequence of
these thousands of genomes. Rather, we think of them each as "the
common genome with gene X disrupted". That is, in most aspects they
are viewed as completely identical to the original strain.
We do have the option of basically generating the conceptual genome
(at least a pretty good approximation of it) and adding it to a
running version of the SEED or the SPROUT. However, there is another approach.
I propose adding a single entity (the recipe of how a
genome was constructed) and a number of
relationships to the SPROUT model:
- Genome WasConstructedBy DerivationChanges
- DerivationChanges IncludeAdditionOf Feature
- DerivationChanges IncludeDeletionOf Feature
- DerivationChanges WereAppliedTo Genome
In this case, the newly-generated Genome is not related to any
Features or Contigs. Rather, it is related to a recipe,
which has the necessary relationships to the original genome.
Back to the Issue of Connecting Phenotype to Variation
In my view, there is no single "right" way to encode phenotype;
there are many ways and none that I know of is clearly superior. I am
going to propose an approach which has the redeeming characteristic of
being both simple and general.
I will begin by thinking of a phenotype as no more than a 3-tuple that
can be attached as a tag to one or more genomes. The fields in the
For boolean attributes (such as essential or
gram-positive), the floating-point number will simply be 0 or
1. In other cases it will reflect a numeric value to be associated
with one or more genomes. That is, I have essentially reduced the
notion of phenotype to the quite different notion of a
measurement, and assume that any single measurement can apply
to an arbitrary set of genomes. In this case, we assume that
variations connected to the distinct genomes will ultimately explain
the differences in measurement. Sorting out the relevant causal
relationships becomes one of the central issues of this century.
- an arbitrary label,
- a single floating-point number, and
- a single text field.
However, we may wish to connect measurements with variation, as well.
That is, once we have catalogued the variations within a family of
closely-related genomes, we may wish to attach measurements to the
variations themselves. Essentially, we wish to be able to encode
statements of the form "of the set of X genomes that have variation Y,
P had the following measurement and N did not". This makes sense for
boolean measurements. Non-boolean properties would be handled (less
than elegantly) by dividing a range of potential values into discreet
"buckets" and then considering presence or absence of a value in a
bucket as a boolean property.
These considerations cause me to propose the following additions to
the existing SPROUT model:
This appears to me to be a reasonable approach. I solicit comments.
- Measurement, which will include fields encoding the
label, numeric value, and arbitrary text.
- Genome HasMeasurement Measurement
- Measurement CorrelatesWithMajorVariation MajorVariation
- Measurement CorrelatesWithSNP SNP